CIS550 Group Project

Screen Shot 2022-10-11 at 1.42.03 PM.png

Hello Oeda Platform Users, Oeda is a user-interactive Platform for analyzing Brazilian E-commerce trends. \ By creating a user-friendly, intuitive tool for e-commerce, we aim to make it easier for merchants to understand and analyze the market and make data-driven decisions to improve their sales and meet consumer needs. \ This part of the project aims to clean and use Brazilian E-commerce data for analytics purposes.

Data cleaning is identifying and correcting errors, inconsistencies, and missing data in a dataset. It is a critical step in data preparation, ensuring that the data is correct, consistent, and suitable for analysis. \ The data cleaning process typically involves several steps. First, the data is examined to identify any errors or inconsistencies. This process may involve visual inspection of the data or using algorithms and statistical techniques to detect abnormalities. Once these errors and inconsistencies have been identified, they must be corrected or removed. These problems may involve manually fixing individual records or using algorithms to correct or impute missing data automatically. Finally, the cleaned datasets are validated to ensure it is correct and ready for analysis.

Imports and Read Data from Files

Setup

At the very first step, we want to set up the necessary packages needed for the data cleaning processes. We would run the following commandsTo install the pandas and sql packages using pip.

This would download and install the pandas and sql packages and any other necessary dependencies. Once the packages are installed, we can later use them in our Python programs by importing in the next step.

The pandas package provides tools for working with data in Python, including functions for reading, writing, and manipulating data in various formats. The sql package provides tools for working with SQL databases in Python, including functions for executing SQL queries and managing database connections. Together, these packages can be used to perform data analysis and manipulation tasks involving both tabular data and SQL databases.

In [ ]:
!pip install pandas==1.1.5
!pip install pandasql
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: pandas==1.1.5 in /usr/local/lib/python3.8/dist-packages (1.1.5)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.8/dist-packages (from pandas==1.1.5) (1.21.6)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.8/dist-packages (from pandas==1.1.5) (2022.6)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas==1.1.5) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.7.3->pandas==1.1.5) (1.15.0)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: pandasql in /usr/local/lib/python3.8/dist-packages (0.7.3)
Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.8/dist-packages (from pandasql) (1.4.44)
Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages (from pandasql) (1.1.5)
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from pandasql) (1.21.6)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.8/dist-packages (from pandas->pandasql) (2022.6)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas->pandasql) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.7.3->pandas->pandasql) (1.15.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.8/dist-packages (from sqlalchemy->pandasql) (2.0.1)

In this part, we import various packages that will be useful for our data import process. We mainly use pandas to process the data. These packages will be beneficial for future data-cleaning strategies.

There are many advantages to importing packages like numpy, datetime, random, and sklearn in our Python programs. Some of these advantages include the following:

  • Functionality: These packages provide a wide range of functions and tools that can be used to perform various tasks in Python, such as working with arrays, dates and times, random numbers, and machine learning algorithms. These functions allow us to write more powerful and flexible programs to solve various problems.

  • Efficiency: The functions and tools provided by these packages are often highly optimized, which means they can be executed more quickly and efficiently than if we were to implement the same functionality from scratch. This benefit can save time and resources and make our programs run faster.

  • Consistency: By using these packages, we can ensure that our code is consistent with established best practices and standards. This feature can make our code easier to read and understand and help prevent errors and bugs.

  • Community: These packages are widely used and well-supported, which means that a large community of users and developers can provide help and support if we have any questions or issues. This trend can make learning and using these packages easier, giving us access to knowledge and resources.

Overall, importing packages like numpy, datetime, random, and sklearn can help us write more powerful, efficient, and consistent Python programs.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import re
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt 
from collections import Counter
import random
from random import sample
import sklearn
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Import Data

We find two datasets that are useful for the project. One of them is the Brazilian E-commerce Dataset which has information on the orders made at the Olist store. The dataset has detailed order information from 2016 to 2018. Here we aim to use pandas to transfer data from CSV form to the data tables. We can then use the tables for future cleaning processes.

To import data from Google Drive using the read_csv method in the pandas' package, we first ensure that we have installed the pandas' package and authorized our Python script to access our Google Drive account.

Then we can use the read_csv method to read a CSV file from our Google Drive account into a pandas data frame. We use this method to clean multiple data sets.

We can then use the various methods and functions of the pandas' package to manipulate and analyze the data.

Read data from Drive

Note: cleaned data

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
items_data = pd.read_csv('/content/drive/My Drive/olist_order_items_dataset.csv')
product_data = pd.read_csv('/content/drive/My Drive/olist_products_dataset.csv')
city_data = pd.read_csv('/content/drive/My Drive/BRAZIL_CITIES.csv')
customer_data = pd.read_csv('/content/drive/My Drive/olist_customers_dataset.csv')  
geolocation_data = pd.read_csv('/content/drive/My Drive/olist_geolocation_dataset.csv')
payment_data = pd.read_csv('/content/drive/My Drive/olist_order_payments_dataset.csv')
order_data = pd.read_csv('/content/drive/My Drive/olist_orders_dataset.csv')
category_data = pd.read_csv('/content/drive/My Drive/product_category_name_translation.csv')
seller_data =  pd.read_csv('/content/drive/My Drive/olist_sellers_dataset.csv')
review_data =  pd.read_csv('/content/drive/My Drive/olist_order_reviews_dataset.csv')

# items_data = pd.read_csv('/content/drive/MyDrive/items.csv')
# product_data = pd.read_csv('/content/drive/MyDrive/product.csv')
# city_data = pd.read_csv('/content/drive/MyDrive/city.csv')
# customer_data = pd.read_csv('/content/drive/MyDrive/customer.csv')  
# geolocation_data = pd.read_csv('/content/drive/MyDrive/geolocation.csv')
# payment_data = pd.read_csv('/content/drive/MyDrive/payment.csv')
# order_data = pd.read_csv('/content/drive/MyDrive/order.csv')
# category_data = pd.read_csv('/content/drive/MyDrive/product_category_name_translation.csv')
# seller_data =  pd.read_csv('/content/drive/MyDrive/seller.csv')
# review_data =  pd.read_csv('/content/drive/MyDrive/review.csv')
#pd.set_option('max_columns', None)

Read data from uploaded files

Note: original data (before cleaning)

We use read_csv() again here to read CSV files containing data about items, products, cities, customers, geolocations, payments, orders, categories, sellers, and reviews. We could use the read_csv method to read this data into a pandas data frame. We can then use it to access and manipulate the data in various ways.

In [ ]:
items_data = pd.read_csv('olist_order_items_dataset.csv')
product_data = pd.read_csv('olist_products_dataset.csv')
city_data = pd.read_csv('BRAZIL_CITIES.csv')
customer_data = pd.read_csv('olist_customers_dataset.csv')
geolocation_data = pd.read_csv('olist_geolocation_dataset.csv')
payment_data = pd.read_csv('olist_order_payments_dataset.csv')
order_data = pd.read_csv('olist_orders_dataset.csv')
category_data = pd.read_csv('product_category_name_translation.csv')
seller_data =  pd.read_csv('olist_sellers_dataset.csv')
review_data =  pd.read_csv('olist_order_reviews_dataset.csv')
pd.set_option('max_columns', None)

Overview of Datasets

Item

The items dataset in the Brazilian e-commerce dataset contains information about the items sold on a Brazilian e-commerce platform. This dataset includes the order number, seller name, item price, and item weight for each sold item. We find the items data from the Brazilian E-commerce dataset very useful because it contains information about a specific item in order. From items_data, we can learn about each item's associated order, product, seller, price, and freight value.

The order number is a unique identifier for each order placed on the platform. The seller's name is the seller's name who sold the item. The item price is the price of the item in Brazilian reais. And the item weight is the weight of the item in kilograms.

This information can be used to understand the characteristics of the items sold on the platform, such as the prices and weights of the items and the sellers who sold them. It can also be used to analyze trends and patterns in the sales data, such as which sellers had the most sales or which items were the most popular.

In [ ]:
items_data
Out[ ]:
order_id order_item_id product_id seller_id shipping_limit_date price freight_value
0 00010242fe8c5a6d1ba2dd792cb16214 1 4244733e06e7ecb4970a6e2683c13e61 48436dade18ac8b2bce089ec2a041202 2017-09-19 09:45:35 58.90 13.29
1 00018f77f2f0320c557190d7a144bdd3 1 e5f2d52b802189ee658865ca93d83a8f dd7ddc04e1b6c2c614352b383efe2d36 2017-05-03 11:05:13 239.90 19.93
2 000229ec398224ef6ca0657da4fc703e 1 c777355d18b72b67abbeef9df44fd0fd 5b51032eddd242adc84c38acab88f23d 2018-01-18 14:48:30 199.00 17.87
3 00024acbcdf0a6daa1e931b038114c75 1 7634da152a4610f1595efa32f14722fc 9d7a1d34a5052409006425275ba1c2b4 2018-08-15 10:10:18 12.99 12.79
4 00042b26cf59d7ce69dfabb4e55b4fd9 1 ac6c3623068f30de03045865e4e10089 df560393f3a51e74553ab94004ba5c87 2017-02-13 13:57:51 199.90 18.14
... ... ... ... ... ... ... ...
112645 fffc94f6ce00a00581880bf54a75a037 1 4aa6014eceb682077f9dc4bffebc05b0 b8bc237ba3788b23da09c0f1f3a3288c 2018-05-02 04:11:01 299.99 43.41
112646 fffcd46ef2263f404302a634eb57f7eb 1 32e07fd915822b0765e448c4dd74c828 f3c38ab652836d21de61fb8314b69182 2018-07-20 04:31:48 350.00 36.53
112647 fffce4705a9662cd70adb13d4a31832d 1 72a30483855e2eafc67aee5dc2560482 c3cfdc648177fdbbbb35635a37472c53 2017-10-30 17:14:25 99.90 16.95
112648 fffe18544ffabc95dfada21779c9644f 1 9c422a519119dcad7575db5af1ba540e 2b3e4a2a3ea8e01938cabda2a3e5cc79 2017-08-21 00:04:32 55.99 8.72
112649 fffe41c64501cc87c801fd61db3f6244 1 350688d9dc1e75ff97be326363655e01 f7ccf836d21b2fb1de37564105216cc1 2018-06-12 17:10:13 43.00 12.79

112650 rows × 7 columns

In [ ]:
items_data.describe()
Out[ ]:
order_item_id price freight_value
count 112650.000000 112650.000000 112650.000000
mean 1.197834 120.653739 19.990320
std 0.705124 183.633928 15.806405
min 1.000000 0.850000 0.000000
25% 1.000000 39.900000 13.080000
50% 1.000000 74.990000 16.260000
75% 1.000000 134.900000 21.150000
max 21.000000 6735.000000 409.680000

Product

The product table in the Brazilian e-commerce dataset contains detailed information about each product sold on the platform. This information includes the category of the product, its volume dimensions, and its weight.

Having this information in a separate table can be helpful for various purposes. For example, we can use it to understand the characteristics of the products sold on the platform, such as their categories, dimensions, and weights. This info can help analyze trends and patterns in the sales data, such as which types of products were the most popular or which products had the most significant volumes.

In addition, having this information in a separate table can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the product table with other tables, such as the order table, to find out which products were ordered the most or to analyze the sales data by product category. This method can help us gain a more detailed and nuanced understanding of the sales data and support more advanced analyses and insights.

In [ ]:
product_data
Out[ ]:
product_id product_category_name product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm
0 1e9e8ef04dbcff4541ed26657ea517e5 perfumaria 40.0 287.0 1.0 225.0 16.0 10.0 14.0
1 3aa071139cb16b67ca9e5dea641aaa2f artes 44.0 276.0 1.0 1000.0 30.0 18.0 20.0
2 96bd76ec8810374ed1b65e291975717f esporte_lazer 46.0 250.0 1.0 154.0 18.0 9.0 15.0
3 cef67bcfe19066a932b7673e239eb23d bebes 27.0 261.0 1.0 371.0 26.0 4.0 26.0
4 9dc1a7de274444849c219cff195d0b71 utilidades_domesticas 37.0 402.0 4.0 625.0 20.0 17.0 13.0
... ... ... ... ... ... ... ... ... ...
32946 a0b7d5a992ccda646f2d34e418fff5a0 moveis_decoracao 45.0 67.0 2.0 12300.0 40.0 40.0 40.0
32947 bf4538d88321d0fd4412a93c974510e6 construcao_ferramentas_iluminacao 41.0 971.0 1.0 1700.0 16.0 19.0 16.0
32948 9a7c6041fa9592d9d9ef6cfe62a71f8c cama_mesa_banho 50.0 799.0 1.0 1400.0 27.0 7.0 27.0
32949 83808703fc0706a22e264b9d75f04a2e informatica_acessorios 60.0 156.0 2.0 700.0 31.0 13.0 20.0
32950 106392145fca363410d287a815be6de4 cama_mesa_banho 58.0 309.0 1.0 2083.0 12.0 2.0 7.0

32951 rows × 9 columns

City

The City dataset in the Brazilian Cities dataset contains information about the demographics of Brazilian municipalities. This information includes each municipality's city name, state, location, taxes, GDP, and population.

This information in the dataset can help study geographic insights about sales and marketing. For example, we can use this information to understand the demographics of the municipalities where the sales were made, such as their population size, GDP, and taxes. This info can help analyze trends and patterns in the sales data, such as which municipalities had the most sales or the highest GDP.

In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the City dataset with other tables, such as the order table, to find out which municipalities had the most sales or to analyze the sales data by city or state.

In [ ]:
city_data
Out[ ]:
CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+ IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG LAT ALT PAY_TV FIXED_PHONES AREA REGIAO_TUR CATEGORIA_TUR ESTIMATED_POP RURAL_URBAN GVA_TOTAL TAXES GDP POP_GDP GDP_CAPITA COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S COMP_T COMP_U Unnamed: 54 Unnamed: 55 Unnamed: 56 Unnamed: 57 Unnamed: 58 Unnamed: 59 Unnamed: 60 Unnamed: 61 Unnamed: 62 Unnamed: 63 Unnamed: 64 Unnamed: 65 Unnamed: 66 Unnamed: 67 Unnamed: 68 Unnamed: 69 Unnamed: 70 Unnamed: 71 Unnamed: 72 Unnamed: 73 Unnamed: 74 Unnamed: 75 Unnamed: 76 Unnamed: 77 Unnamed: 78 Unnamed: 79 Unnamed: 80 Unnamed: 81 Unnamed: 82 Unnamed: 83 Unnamed: 84 Unnamed: 85 Unnamed: 86 Unnamed: 87 Unnamed: 88 Unnamed: 89 Unnamed: 90 Unnamed: 91 Unnamed: 92 Unnamed: 93 Unnamed: 94 Unnamed: 95 Unnamed: 96 Unnamed: 97 Unnamed: 98 Unnamed: 99 Unnamed: 100 Unnamed: 101 Unnamed: 102 Unnamed: 103 Unnamed: 104 Unnamed: 105 Unnamed: 106 Unnamed: 107 Unnamed: 108 Unnamed: 109 Unnamed: 110 Unnamed: 111 Unnamed: 112 Unnamed: 113 Unnamed: 114 Unnamed: 115 Unnamed: 116 Unnamed: 117 Unnamed: 118 Unnamed: 119 Unnamed: 120 Unnamed: 121 Unnamed: 122 Unnamed: 123 Unnamed: 124 Unnamed: 125 Unnamed: 126 Unnamed: 127 Unnamed: 128 Unnamed: 129 Unnamed: 130 Unnamed: 131 Unnamed: 132 Unnamed: 133 Unnamed: 134 Unnamed: 135 Unnamed: 136 Unnamed: 137 Unnamed: 138 Unnamed: 139 Unnamed: 140 Unnamed: 141 Unnamed: 142 Unnamed: 143 Unnamed: 144 Unnamed: 145 Unnamed: 146 Unnamed: 147 Unnamed: 148 Unnamed: 149 Unnamed: 150 Unnamed: 151 Unnamed: 152 Unnamed: 153 Unnamed: 154 Unnamed: 155 Unnamed: 156 Unnamed: 157 Unnamed: 158 Unnamed: 159 Unnamed: 160 Unnamed: 161 Unnamed: 162 Unnamed: 163 Unnamed: 164 Unnamed: 165 Unnamed: 166 Unnamed: 167 Unnamed: 168 Unnamed: 169 Unnamed: 170 Unnamed: 171 Unnamed: 172 Unnamed: 173 Unnamed: 174 Unnamed: 175 Unnamed: 176 Unnamed: 177 Unnamed: 178 Unnamed: 179 Unnamed: 180 Unnamed: 181 Unnamed: 182 Unnamed: 183 Unnamed: 184 Unnamed: 185 Unnamed: 186 Unnamed: 187 Unnamed: 188 Unnamed: 189 Unnamed: 190 Unnamed: 191 Unnamed: 192 Unnamed: 193 Unnamed: 194 Unnamed: 195 Unnamed: 196 Unnamed: 197 Unnamed: 198 Unnamed: 199 Unnamed: 200 Unnamed: 201 Unnamed: 202 Unnamed: 203 Unnamed: 204 Unnamed: 205 Unnamed: 206 Unnamed: 207 Unnamed: 208 Unnamed: 209 Unnamed: 210 Unnamed: 211 Unnamed: 212 Unnamed: 213 Unnamed: 214 Unnamed: 215 Unnamed: 216 Unnamed: 217 Unnamed: 218 Unnamed: 219 Unnamed: 220 Unnamed: 221 Unnamed: 222 Unnamed: 223 Unnamed: 224 Unnamed: 225 Unnamed: 226 Unnamed: 227 Unnamed: 228
0 Abadia De Goiás GO 0 6876.0 6876.0 0.0 69.0 318.0 438.0 517.0 3542.0 416.0 1689.0 0.708 0.687 0.830 0.622 -49.440548 -16.758812 893.60 360.0 842.0 147.26 NaN NaN 8583.0 Urbano 145857.60 20554.20 166.41 8053.0 20664.57 284.0 5.0 1.0 56.0 0.0 2.0 29.0 110.0 26.0 4.0 5.0 0.0 2.0 10.0 12.0 4.0 6.0 6.0 1.0 5.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Abadia Dos Dourados MG 0 6704.0 6704.0 0.0 38.0 207.0 260.0 351.0 2709.0 589.0 2207.0 0.690 0.693 0.839 0.563 -47.396832 -18.487565 753.12 77.0 296.0 881.06 Caminhos Do Cerrado D 6972.0 Rural Adjacente 167215.28 12873.50 180.09 7037.0 25591.70 476.0 6.0 6.0 30.0 1.0 2.0 34.0 190.0 70.0 28.0 11.0 0.0 4.0 15.0 29.0 2.0 9.0 14.0 6.0 19.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Abadiânia GO 0 15757.0 15609.0 148.0 139.0 650.0 894.0 1087.0 6896.0 990.0 2202.0 0.690 0.671 0.841 0.579 -48.718812 -16.182672 1017.55 227.0 720.0 1,045.13 Região Turística Do Ouro E Cristais C 19614.0 Rural Adjacente 261161.91 26822.58 287984.49 18427.0 15628.40 288.0 5.0 9.0 26.0 0.0 2.0 7.0 117.0 12.0 57.0 2.0 1.0 0.0 7.0 15.0 3.0 11.0 5.0 1.0 8.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Abaeté MG 0 22690.0 22690.0 0.0 176.0 856.0 1233.0 1539.0 11979.0 2681.0 1994.0 0.698 0.720 0.848 0.556 -45.446191 -19.155848 644.74 1230.0 1716.0 1,817.07 Lago De Três Marias D 23223.0 Urbano 403241.27 26994.09 430235.36 23574.0 18250.42 621.0 18.0 1.0 40.0 0.0 1.0 20.0 303.0 62.0 30.0 9.0 6.0 4.0 28.0 27.0 2.0 15.0 19.0 9.0 27.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Abaetetuba PA 0 141100.0 141040.0 60.0 1354.0 5567.0 7618.0 8905.0 53516.0 5996.0 3530.0 0.628 0.579 0.798 0.537 -48.884404 -1.723470 10.12 3389.0 1218.0 1,610.65 Araguaia-Tocantins D 156292.0 Urbano 1154074.81 95180.48 1249255.29 151934.0 8222.36 931.0 4.0 2.0 43.0 0.0 1.0 27.0 500.0 16.0 31.0 6.0 1.0 1.0 22.0 16.0 2.0 155.0 33.0 15.0 56.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5568 Xique-Xique BA 0 45536.0 45536.0 0.0 518.0 2318.0 3095.0 3620.0 19371.0 3575.0 4533.0 0.585 0.563 0.741 0.479 -42.725508 -10.824974 406.26 169.0 1267.0 5,079.66 Chapada Diamantina D 46440.0 Urbano 328930.58 13330.49 342261.06 48274.0 7089.97 420.0 2.0 1.0 23.0 0.0 1.0 10.0 299.0 6.0 11.0 2.0 4.0 0.0 5.0 9.0 2.0 6.0 19.0 3.0 17.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5569 Zabelê PB 0 2075.0 2075.0 0.0 20.0 110.0 140.0 121.0 874.0 204.0 3639.0 0.623 0.567 0.725 0.587 -37.093552 -8.076874 646.34 2.0 17.0 109.39 Cariri E 2225.0 Rural Adjacente 18946.95 635.10 19582.06 2227.0 8793.02 20.0 0.0 0.0 2.0 0.0 0.0 1.0 8.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 2.0 1.0 0.0 0.0 5.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5570 Zacarias SP 0 2335.0 2335.0 0.0 21.0 104.0 123.0 127.0 1165.0 296.0 1072.0 0.730 0.695 0.826 0.674 -50.055740 -21.050110 415.85 155.0 181.0 319.06 NaN NaN 2684.0 Rural Adjacente 73.95 1988.33 75934.89 2605.0 29149.67 111.0 65.0 0.0 4.0 0.0 0.0 0.0 17.0 4.0 5.0 0.0 0.0 0.0 1.0 1.0 3.0 1.0 1.0 2.0 7.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5571 Zé Doca MA 0 50173.0 50152.0 21.0 541.0 2366.0 3087.0 3505.0 18136.0 3168.0 4272.0 0.595 0.559 0.745 0.505 -45.657698 -3.275481 35.66 6493.0 821.0 2,140.11 NaN NaN 51471.0 Urbano 357.90 22610.35 380509.30 50806.0 7489.46 470.0 3.0 0.0 14.0 0.0 0.0 17.0 334.0 5.0 13.0 4.0 0.0 2.0 6.0 9.0 2.0 11.0 9.0 3.0 38.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5572 Zortéa SC 0 2991.0 2991.0 0.0 31.0 132.0 211.0 190.0 1506.0 256.0 364.0 0.760 0.752 0.885 0.661 -51.549566 -27.450251 685.30 68.0 219.0 190.18 Vale Do Contestado E 3328.0 Intermediário Adjacente 63983.64 3014.65 66998.29 3264.0 20526.44 76.0 0.0 0.0 7.0 2.0 0.0 3.0 45.0 5.0 2.0 0.0 0.0 0.0 1.0 3.0 2.0 1.0 1.0 1.0 3.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5573 rows × 229 columns

Customer

The customer dataset in the Brazilian e-commerce dataset contains information about the locations of each customer. This information includes the zip code, city, and state of each customer.

This information in the dataset can help study customers' shopping habits and preferences. For example, we can use this information to understand the locations of the customers who made purchases on the platform, such as their zip codes, cities, and states. This method can help analyze trends and patterns in the sales data, such as which locations had the most sales or which customers had the most purchases.

In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the customer dataset with other tables, such as the order table, to find out which customers made the most purchases or to analyze the sales data by customer location.

In [ ]:
customer_data
Out[ ]:
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP
1 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 9790 sao bernardo do campo SP
2 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 1151 sao paulo SP
3 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 8775 mogi das cruzes SP
4 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP
... ... ... ... ... ...
99436 17ddf5dd5d51696bb3d7c6291687be6f 1a29b476fee25c95fbafc67c5ac95cf8 3937 sao paulo SP
99437 e7b71a9017aa05c9a7fd292d714858e8 d52a67c98be1cf6a5c84435bd38d095d 6764 taboao da serra SP
99438 5e28dfe12db7fb50a4b2f691faecea5e e9f50caf99f032f0bf3c55141f019d99 60115 fortaleza CE
99439 56b18e2166679b8a959d72dd06da27f9 73c2643a0a458b49f58cea58833b192e 92120 canoas RS
99440 274fa6071e5e17fe303b9748641082c8 84732c5050c01db9b23e19ba39899398 6703 cotia SP

99441 rows × 5 columns

In [ ]:
customer_data.groupby('customer_city')['customer_id'].count().sort_values(ascending=False).head(10)
Out[ ]:
customer_city
sao paulo                15540
rio de janeiro            6882
belo horizonte            2773
brasilia                  2131
curitiba                  1521
campinas                  1444
porto alegre              1379
salvador                  1245
guarulhos                 1189
sao bernardo do campo      938
Name: customer_id, dtype: int64

Geolocation

The geolocation dataset in the Brazilian Cities dataset contains the latitude and longitude coordinates for each zip code. This information can be used to visualize the geographic distribution of customers, sellers, or sales data.

For example, we wanted to create a map showing the locations of customers who made purchases on the platform. In that case, we could use the latitude and longitude coordinates from the geolocation dataset to plot the locations of these customers on the map. This info could help us understand the geographic distribution of the sales data and provide insights into customers' regional shopping habits and preferences.

Similarly, we wanted to create a map showing the locations of sellers who sold items on the platform. In that case, we could use the latitude and longitude coordinates from the geolocation dataset to plot the locations of these sellers on the map. This info could help us understand the sellers' geographic distribution and provide insights into the regional patterns of sales and commerce on the platform.

Overall, the geolocation dataset can be a valuable tool for creating visualizations and gaining insights into the geographic distribution of customers, sellers, and sales data on the Brazilian e-commerce platform.

In [ ]:
geolocation_data
Out[ ]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
0 1037 -23.545621 -46.639292 sao paulo SP
1 1046 -23.546081 -46.644820 sao paulo SP
2 1046 -23.546129 -46.642951 sao paulo SP
3 1041 -23.544392 -46.639499 sao paulo SP
4 1035 -23.541578 -46.641607 sao paulo SP
... ... ... ... ... ...
1000158 99950 -28.068639 -52.010705 tapejara RS
1000159 99900 -27.877125 -52.224882 getulio vargas RS
1000160 99950 -28.071855 -52.014716 tapejara RS
1000161 99980 -28.388932 -51.846871 david canabarro RS
1000162 99950 -28.070104 -52.018658 tapejara RS

1000163 rows × 5 columns

Payment

The payment dataset in the Brazilian e-commerce dataset contains information about how each order was paid. This information includes the payment type, payment installments, and payment values for each order.

Having this information in the dataset can help study customers' payment habits. For example, we can use this information to understand the types of payments used on the platform, such as cash, credit card, or bank transfer. This method can help analyze trends and patterns in the sales data, such as which payment types were the most common or which orders had the highest payment values.

In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the payment dataset with other tables, such as the order table, to find out which orders had the highest payment values or to analyze the sales data by payment type.

In [ ]:
payment_data
Out[ ]:
order_id payment_sequential payment_type payment_installments payment_value
0 b81ef226f3fe1789b1e8b2acac839d17 1 credit_card 8 99.33
1 a9810da82917af2d9aefd1278f1dcfa0 1 credit_card 1 24.39
2 25e8ea4e93396b6fa0d3dd708e76c1bd 1 credit_card 1 65.71
3 ba78997921bbcdc1373bb41e913ab953 1 credit_card 8 107.78
4 42fdf880ba16b47b59251dd489d4441a 1 credit_card 2 128.45
... ... ... ... ... ...
103881 0406037ad97740d563a178ecc7a2075c 1 boleto 1 363.31
103882 7b905861d7c825891d6347454ea7863f 1 credit_card 2 96.80
103883 32609bbb3dd69b3c066a6860554a77bf 1 credit_card 1 47.77
103884 b8b61059626efa996a60be9bb9320e10 1 credit_card 5 369.54
103885 28bbae6599b09d39ca406b747b6632b1 1 boleto 1 191.58

103886 rows × 5 columns

In [ ]:
payment_data.describe()
Out[ ]:
payment_sequential payment_installments payment_value
count 103886.000000 103886.000000 103886.000000
mean 1.092679 2.853349 154.100380
std 0.706584 2.687051 217.494064
min 1.000000 0.000000 0.000000
25% 1.000000 1.000000 56.790000
50% 1.000000 1.000000 100.000000
75% 1.000000 4.000000 171.837500
max 29.000000 24.000000 13664.080000

OrderInfo

The order dataset in the Brazilian e-commerce dataset contains information about each order placed on the platform. This information includes the customer who placed the order, the time when the order was placed, the time when the order was approved, and the time when the order was delivered.

This information in the dataset can help understand the processing time of each order. For example, we can use this information to calculate the time between when an order was placed and when it was delivered, which can give us an idea of how long it takes for orders to be processed and delivered. This info can help analyze trends and patterns in the sales data, such as which orders took the longest to be offered or which customers had the most orders.

In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the ordered dataset with the customer dataset, allowing us to analyze the sales data by the customer and gain a more detailed and nuanced understanding of the sales data.

In [ ]:
order_data
Out[ ]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
0 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 2017-10-04 19:55:00 2017-10-10 21:25:13 2017-10-18 00:00:00
1 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 2018-07-26 14:31:00 2018-08-07 15:27:45 2018-08-13 00:00:00
2 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 2018-08-08 13:50:00 2018-08-17 18:06:29 2018-09-04 00:00:00
3 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 2017-11-22 13:39:59 2017-12-02 00:28:42 2017-12-15 00:00:00
4 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00
... ... ... ... ... ... ... ... ...
99436 9c5dedf39a927c1b2549525ed64a053c 39bd1228ee8140590ac3aca26f2dfe00 delivered 2017-03-09 09:54:05 2017-03-09 09:54:05 2017-03-10 11:18:03 2017-03-17 15:08:01 2017-03-28 00:00:00
99437 63943bddc261676b46f01ca7ac2f7bd8 1fca14ff2861355f6e5f14306ff977a7 delivered 2018-02-06 12:58:58 2018-02-06 13:10:37 2018-02-07 23:22:42 2018-02-28 17:37:56 2018-03-02 00:00:00
99438 83c1379a015df1e13d02aae0204711ab 1aa71eb042121263aafbe80c1b562c9c delivered 2017-08-27 14:46:43 2017-08-27 15:04:16 2017-08-28 20:52:26 2017-09-21 11:24:17 2017-09-27 00:00:00
99439 11c177c8e97725db2631073c19f07b62 b331b74b18dc79bcdf6532d51e1637c1 delivered 2018-01-08 21:28:27 2018-01-08 21:36:21 2018-01-12 15:35:03 2018-01-25 23:32:54 2018-02-15 00:00:00
99440 66dea50a8b16d9b4dee7af250b4be1a5 edb027a75a1449115f6b43211ae02a24 delivered 2018-03-08 20:57:30 2018-03-09 11:20:28 2018-03-09 22:11:59 2018-03-16 13:08:30 2018-04-03 00:00:00

99441 rows × 8 columns

In [ ]:
order_data.describe()
Out[ ]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
count 99441 99441 99441 99441 99281 97658 96476 99441
unique 99441 99441 8 98875 90733 81018 95664 459
top a45a7a951625bffc98203d07f77e872e e933c6179317541b4db832474621281f delivered 2018-06-01 13:39:44 2018-02-27 04:31:10 2018-05-09 15:48:00 2018-02-14 21:09:19 2017-12-20 00:00:00
freq 1 1 96478 3 9 47 3 522

Category

The category dataset in the Brazilian e-commerce dataset contains information about the categories of products that were sold on the platform. This information includes the name and description of each product category.

This information in the dataset can help analyze the top-rated product categories. For example, we can use this information to group the products by type and to calculate the number of sales or the average ratings for each category. This information can give us an idea of which products were the most popular or highly rated on the platform.

In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the category dataset with other tables, such as the order table or the review table, to find out which categories had the most sales or the highest ratings or to analyze the sales data by type.

In [ ]:
category_data
Out[ ]:
product_category_name product_category_name_english
0 beleza_saude health_beauty
1 informatica_acessorios computers_accessories
2 automotivo auto
3 cama_mesa_banho bed_bath_table
4 moveis_decoracao furniture_decor
... ... ...
66 flores flowers
67 artes_e_artesanato arts_and_craftmanship
68 fraldas_higiene diapers_and_hygiene
69 fashion_roupa_infanto_juvenil fashion_childrens_clothes
70 seguros_e_servicos security_and_services

71 rows × 2 columns

Seller

The seller dataset in the Brazilian e-commerce dataset contains information about the locations and unique IDs of each seller on the platform. This information includes each seller's city, state, and an individual ID.

This information in the dataset can help analyze which cities have the most sellers. For example, we can use this information to group the sellers by city and to calculate the number of sellers in each town. This info can give us an idea of which cities have the most sellers on the platform and can provide insights into the regional patterns of sales and commerce.

In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the seller dataset with other tables, such as the order table or the delivery table, to find out which sellers had the most sales or the fastest delivery times or to analyze the sales data by seller location.

In [ ]:
seller_data
Out[ ]:
seller_id seller_zip_code_prefix seller_city seller_state
0 3442f8959a84dea7ee197c632cb2df15 13023 campinas SP
1 d1b65fc7debc3361ea86b5f14c68d2e2 13844 mogi guacu SP
2 ce3ad9de960102d0677a81f5d0bb7b2d 20031 rio de janeiro RJ
3 c0f3eea2e14555b6faeea3dd58c1b1c3 4195 sao paulo SP
4 51a04a8a6bdcb23deccc82b0b80742cf 12914 braganca paulista SP
... ... ... ... ...
3090 98dddbc4601dd4443ca174359b237166 87111 sarandi PR
3091 f8201cab383e484733266d1906e2fdfa 88137 palhoca SC
3092 74871d19219c7d518d0090283e03c137 4650 sao paulo SP
3093 e603cf3fec55f8697c9059638d6c8eb5 96080 pelotas RS
3094 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP

3095 rows × 4 columns

In [ ]:
seller_data.groupby('seller_city')['seller_id'].count().sort_values(ascending=False).head(10)
Out[ ]:
seller_city
sao paulo         694
curitiba          127
rio de janeiro     96
belo horizonte     68
ribeirao preto     52
guarulhos          50
ibitinga           49
santo andre        45
campinas           41
maringa            40
Name: seller_id, dtype: int64

Review

The review dataset in the Brazilian e-commerce dataset contains information about the reviews that were left for each product on the platform. This information includes the ratings and comments that customers left for each product, as well as the category of the product.

This information in the dataset can help find the most top-rated products or categories. For example, we can use this information to group the products by type and calculate each category's average rating. This query can give us an idea of which sorts of products were the most highly rated on the platform and can provide insights into the preferences and opinions of customers.

In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the review dataset with other tables, such as the order table or the category table, to find out which products had the most sales or the highest ratings or to analyze the sales data by product category.

In [ ]:
review_data
Out[ ]:
review_id order_id review_score review_comment_title review_comment_message review_creation_date review_answer_timestamp
0 7bc2406110b926393aa56f80a40eba40 73fc7af87114b39712e6da79b0a377eb 4 NaN NaN 2018-01-18 00:00:00 2018-01-18 21:46:59
1 80e641a11e56f04c1ad469d5645fdfde a548910a1c6147796b98fdf73dbeba33 5 NaN NaN 2018-03-10 00:00:00 2018-03-11 03:05:13
2 228ce5500dc1d8e020d8d1322874b6f0 f9e4b658b201a9f2ecdecbb34bed034b 5 NaN NaN 2018-02-17 00:00:00 2018-02-18 14:36:24
3 e64fb393e7b32834bb789ff8bb30750e 658677c97b385a9be170737859d3511b 5 NaN Recebi bem antes do prazo estipulado. 2017-04-21 00:00:00 2017-04-21 22:02:06
4 f7c4243c7fe1938f181bec41a392bdeb 8e6bfb81e283fa7e4f11123a3fb894f1 5 NaN Parabéns lojas lannister adorei comprar pela I... 2018-03-01 00:00:00 2018-03-02 10:26:53
... ... ... ... ... ... ... ...
99219 574ed12dd733e5fa530cfd4bbf39d7c9 2a8c23fee101d4d5662fa670396eb8da 5 NaN NaN 2018-07-07 00:00:00 2018-07-14 17:18:30
99220 f3897127253a9592a73be9bdfdf4ed7a 22ec9f0669f784db00fa86d035cf8602 5 NaN NaN 2017-12-09 00:00:00 2017-12-11 20:06:42
99221 b3de70c89b1510c4cd3d0649fd302472 55d4004744368f5571d1f590031933e4 5 NaN Excelente mochila, entrega super rápida. Super... 2018-03-22 00:00:00 2018-03-23 09:10:43
99222 1adeb9d84d72fe4e337617733eb85149 7725825d039fc1f0ceb7635e3f7d9206 4 NaN NaN 2018-07-01 00:00:00 2018-07-02 12:59:13
99223 efe49f1d6f951dd88b51e6ccd4cc548f 90531360ecb1eec2a1fbb265a0db0508 1 NaN meu produto chegou e ja tenho que devolver, po... 2017-07-03 00:00:00 2017-07-03 21:01:49

99224 rows × 7 columns

In [ ]:
review_data.describe()
Out[ ]:
review_score
count 99224.000000
mean 4.086421
std 1.347579
min 1.000000
25% 4.000000
50% 5.000000
75% 5.000000
max 5.000000

Data Cleaning

After identifying the tables and datasets available in the Brazilian e-commerce dataset and the Brazilian Cities dataset, the next step in the data cleaning process would be to handle any null or missing values in the data.

One way to do this is to use the dropna() method provided by the pandas' package, which can be used to remove rows that contain null or missing values from a data frame. The dropna() method would remove any rows containing null values from the df data frame, and the resulting data frame would only have rows with non-null values for all columns. Alternatively, we can use the notnull() method to select only the columns with non-null values.

In [ ]:
order_items_data = items_data.dropna()
product_data = product_data.dropna()
city_data = city_data[city_data['CITY'].notnull()]
geolocation_data = geolocation_data.dropna()
customer_data = customer_data.dropna()
payment_data = payment_data.dropna()
review_data = review_data.dropna()
order_data = order_data.dropna()
seller_data = seller_data.dropna()
category_data = category_data.dropna()
In [ ]:
product_data
Out[ ]:
product_id product_category_name product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm
0 1e9e8ef04dbcff4541ed26657ea517e5 perfumaria 40.0 287.0 1.0 225.0 16.0 10.0 14.0
1 3aa071139cb16b67ca9e5dea641aaa2f artes 44.0 276.0 1.0 1000.0 30.0 18.0 20.0
2 96bd76ec8810374ed1b65e291975717f esporte_lazer 46.0 250.0 1.0 154.0 18.0 9.0 15.0
3 cef67bcfe19066a932b7673e239eb23d bebes 27.0 261.0 1.0 371.0 26.0 4.0 26.0
4 9dc1a7de274444849c219cff195d0b71 utilidades_domesticas 37.0 402.0 4.0 625.0 20.0 17.0 13.0
... ... ... ... ... ... ... ... ... ...
32946 a0b7d5a992ccda646f2d34e418fff5a0 moveis_decoracao 45.0 67.0 2.0 12300.0 40.0 40.0 40.0
32947 bf4538d88321d0fd4412a93c974510e6 construcao_ferramentas_iluminacao 41.0 971.0 1.0 1700.0 16.0 19.0 16.0
32948 9a7c6041fa9592d9d9ef6cfe62a71f8c cama_mesa_banho 50.0 799.0 1.0 1400.0 27.0 7.0 27.0
32949 83808703fc0706a22e264b9d75f04a2e informatica_acessorios 60.0 156.0 2.0 700.0 31.0 13.0 20.0
32950 106392145fca363410d287a815be6de4 cama_mesa_banho 58.0 309.0 1.0 2083.0 12.0 2.0 7.0

32340 rows × 9 columns

In [ ]:
city_data
Out[ ]:
CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+ IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG LAT ALT PAY_TV FIXED_PHONES AREA REGIAO_TUR CATEGORIA_TUR ESTIMATED_POP RURAL_URBAN GVA_TOTAL TAXES GDP POP_GDP GDP_CAPITA COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S COMP_T COMP_U Unnamed: 54 Unnamed: 55 Unnamed: 56 Unnamed: 57 Unnamed: 58 Unnamed: 59 Unnamed: 60 Unnamed: 61 Unnamed: 62 Unnamed: 63 Unnamed: 64 Unnamed: 65 Unnamed: 66 Unnamed: 67 Unnamed: 68 Unnamed: 69 Unnamed: 70 Unnamed: 71 Unnamed: 72 Unnamed: 73 Unnamed: 74 Unnamed: 75 Unnamed: 76 Unnamed: 77 Unnamed: 78 Unnamed: 79 Unnamed: 80 Unnamed: 81 Unnamed: 82 Unnamed: 83 Unnamed: 84 Unnamed: 85 Unnamed: 86 Unnamed: 87 Unnamed: 88 Unnamed: 89 Unnamed: 90 Unnamed: 91 Unnamed: 92 Unnamed: 93 Unnamed: 94 Unnamed: 95 Unnamed: 96 Unnamed: 97 Unnamed: 98 Unnamed: 99 Unnamed: 100 Unnamed: 101 Unnamed: 102 Unnamed: 103 Unnamed: 104 Unnamed: 105 Unnamed: 106 Unnamed: 107 Unnamed: 108 Unnamed: 109 Unnamed: 110 Unnamed: 111 Unnamed: 112 Unnamed: 113 Unnamed: 114 Unnamed: 115 Unnamed: 116 Unnamed: 117 Unnamed: 118 Unnamed: 119 Unnamed: 120 Unnamed: 121 Unnamed: 122 Unnamed: 123 Unnamed: 124 Unnamed: 125 Unnamed: 126 Unnamed: 127 Unnamed: 128 Unnamed: 129 Unnamed: 130 Unnamed: 131 Unnamed: 132 Unnamed: 133 Unnamed: 134 Unnamed: 135 Unnamed: 136 Unnamed: 137 Unnamed: 138 Unnamed: 139 Unnamed: 140 Unnamed: 141 Unnamed: 142 Unnamed: 143 Unnamed: 144 Unnamed: 145 Unnamed: 146 Unnamed: 147 Unnamed: 148 Unnamed: 149 Unnamed: 150 Unnamed: 151 Unnamed: 152 Unnamed: 153 Unnamed: 154 Unnamed: 155 Unnamed: 156 Unnamed: 157 Unnamed: 158 Unnamed: 159 Unnamed: 160 Unnamed: 161 Unnamed: 162 Unnamed: 163 Unnamed: 164 Unnamed: 165 Unnamed: 166 Unnamed: 167 Unnamed: 168 Unnamed: 169 Unnamed: 170 Unnamed: 171 Unnamed: 172 Unnamed: 173 Unnamed: 174 Unnamed: 175 Unnamed: 176 Unnamed: 177 Unnamed: 178 Unnamed: 179 Unnamed: 180 Unnamed: 181 Unnamed: 182 Unnamed: 183 Unnamed: 184 Unnamed: 185 Unnamed: 186 Unnamed: 187 Unnamed: 188 Unnamed: 189 Unnamed: 190 Unnamed: 191 Unnamed: 192 Unnamed: 193 Unnamed: 194 Unnamed: 195 Unnamed: 196 Unnamed: 197 Unnamed: 198 Unnamed: 199 Unnamed: 200 Unnamed: 201 Unnamed: 202 Unnamed: 203 Unnamed: 204 Unnamed: 205 Unnamed: 206 Unnamed: 207 Unnamed: 208 Unnamed: 209 Unnamed: 210 Unnamed: 211 Unnamed: 212 Unnamed: 213 Unnamed: 214 Unnamed: 215 Unnamed: 216 Unnamed: 217 Unnamed: 218 Unnamed: 219 Unnamed: 220 Unnamed: 221 Unnamed: 222 Unnamed: 223 Unnamed: 224 Unnamed: 225 Unnamed: 226 Unnamed: 227 Unnamed: 228
0 Abadia De Goiás GO 0 6876.0 6876.0 0.0 69.0 318.0 438.0 517.0 3542.0 416.0 1689.0 0.708 0.687 0.830 0.622 -49.440548 -16.758812 893.60 360.0 842.0 147.26 NaN NaN 8583.0 Urbano 145857.60 20554.20 166.41 8053.0 20664.57 284.0 5.0 1.0 56.0 0.0 2.0 29.0 110.0 26.0 4.0 5.0 0.0 2.0 10.0 12.0 4.0 6.0 6.0 1.0 5.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Abadia Dos Dourados MG 0 6704.0 6704.0 0.0 38.0 207.0 260.0 351.0 2709.0 589.0 2207.0 0.690 0.693 0.839 0.563 -47.396832 -18.487565 753.12 77.0 296.0 881.06 Caminhos Do Cerrado D 6972.0 Rural Adjacente 167215.28 12873.50 180.09 7037.0 25591.70 476.0 6.0 6.0 30.0 1.0 2.0 34.0 190.0 70.0 28.0 11.0 0.0 4.0 15.0 29.0 2.0 9.0 14.0 6.0 19.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Abadiânia GO 0 15757.0 15609.0 148.0 139.0 650.0 894.0 1087.0 6896.0 990.0 2202.0 0.690 0.671 0.841 0.579 -48.718812 -16.182672 1017.55 227.0 720.0 1,045.13 Região Turística Do Ouro E Cristais C 19614.0 Rural Adjacente 261161.91 26822.58 287984.49 18427.0 15628.40 288.0 5.0 9.0 26.0 0.0 2.0 7.0 117.0 12.0 57.0 2.0 1.0 0.0 7.0 15.0 3.0 11.0 5.0 1.0 8.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Abaeté MG 0 22690.0 22690.0 0.0 176.0 856.0 1233.0 1539.0 11979.0 2681.0 1994.0 0.698 0.720 0.848 0.556 -45.446191 -19.155848 644.74 1230.0 1716.0 1,817.07 Lago De Três Marias D 23223.0 Urbano 403241.27 26994.09 430235.36 23574.0 18250.42 621.0 18.0 1.0 40.0 0.0 1.0 20.0 303.0 62.0 30.0 9.0 6.0 4.0 28.0 27.0 2.0 15.0 19.0 9.0 27.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Abaetetuba PA 0 141100.0 141040.0 60.0 1354.0 5567.0 7618.0 8905.0 53516.0 5996.0 3530.0 0.628 0.579 0.798 0.537 -48.884404 -1.723470 10.12 3389.0 1218.0 1,610.65 Araguaia-Tocantins D 156292.0 Urbano 1154074.81 95180.48 1249255.29 151934.0 8222.36 931.0 4.0 2.0 43.0 0.0 1.0 27.0 500.0 16.0 31.0 6.0 1.0 1.0 22.0 16.0 2.0 155.0 33.0 15.0 56.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5568 Xique-Xique BA 0 45536.0 45536.0 0.0 518.0 2318.0 3095.0 3620.0 19371.0 3575.0 4533.0 0.585 0.563 0.741 0.479 -42.725508 -10.824974 406.26 169.0 1267.0 5,079.66 Chapada Diamantina D 46440.0 Urbano 328930.58 13330.49 342261.06 48274.0 7089.97 420.0 2.0 1.0 23.0 0.0 1.0 10.0 299.0 6.0 11.0 2.0 4.0 0.0 5.0 9.0 2.0 6.0 19.0 3.0 17.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5569 Zabelê PB 0 2075.0 2075.0 0.0 20.0 110.0 140.0 121.0 874.0 204.0 3639.0 0.623 0.567 0.725 0.587 -37.093552 -8.076874 646.34 2.0 17.0 109.39 Cariri E 2225.0 Rural Adjacente 18946.95 635.10 19582.06 2227.0 8793.02 20.0 0.0 0.0 2.0 0.0 0.0 1.0 8.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 2.0 1.0 0.0 0.0 5.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5570 Zacarias SP 0 2335.0 2335.0 0.0 21.0 104.0 123.0 127.0 1165.0 296.0 1072.0 0.730 0.695 0.826 0.674 -50.055740 -21.050110 415.85 155.0 181.0 319.06 NaN NaN 2684.0 Rural Adjacente 73.95 1988.33 75934.89 2605.0 29149.67 111.0 65.0 0.0 4.0 0.0 0.0 0.0 17.0 4.0 5.0 0.0 0.0 0.0 1.0 1.0 3.0 1.0 1.0 2.0 7.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5571 Zé Doca MA 0 50173.0 50152.0 21.0 541.0 2366.0 3087.0 3505.0 18136.0 3168.0 4272.0 0.595 0.559 0.745 0.505 -45.657698 -3.275481 35.66 6493.0 821.0 2,140.11 NaN NaN 51471.0 Urbano 357.90 22610.35 380509.30 50806.0 7489.46 470.0 3.0 0.0 14.0 0.0 0.0 17.0 334.0 5.0 13.0 4.0 0.0 2.0 6.0 9.0 2.0 11.0 9.0 3.0 38.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5572 Zortéa SC 0 2991.0 2991.0 0.0 31.0 132.0 211.0 190.0 1506.0 256.0 364.0 0.760 0.752 0.885 0.661 -51.549566 -27.450251 685.30 68.0 219.0 190.18 Vale Do Contestado E 3328.0 Intermediário Adjacente 63983.64 3014.65 66998.29 3264.0 20526.44 76.0 0.0 0.0 7.0 2.0 0.0 3.0 45.0 5.0 2.0 0.0 0.0 0.0 1.0 3.0 2.0 1.0 1.0 1.0 3.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5573 rows × 229 columns

Cleaning Order Data

In the orders dataset of the Brazilian e-commerce dataset, we may have information about the delivery, process, approval, and purchase time and date for each order. To make it easier to perform queries and analyses on this data, we can use the DatetimeIndex function provided by the pandas' package to normalize the purchasing, processing, and delivery times in the order data.

To do this, we would first need to load the order data into a pandas data frame and then use the DatetimeIndex function to create a DateTime index for the data frame.

The DatetimeIndex function would take the delivered_date column from the data frame and create a DateTime index for the data frame based on the values in this column. This function would allow us to perform queries and analyses on the order data using the DateTime index, making it easier to handle and manipulate.

For example, once we have created the DateTime index, we can select only the rows in the data frame that fall within a specific date range. We can also use it to group the data by date and to perform calculations such as the total number of orders or the full value of orders within a given date range. These operations can help analyze trends and patterns in the sales data and support more advanced analyses and insights.

In [ ]:
order_data['order_purchase_year'] = pd.DatetimeIndex(order_data['order_purchase_timestamp']).year
order_data['order_purchase_month'] = pd.DatetimeIndex(order_data['order_purchase_timestamp']).month
order_data['order_purchase_day'] = pd.DatetimeIndex(order_data['order_purchase_timestamp']).day
order_data['order_approve_year'] = pd.DatetimeIndex(order_data['order_approved_at']).year
order_data['order_approve_month'] = pd.DatetimeIndex(order_data['order_approved_at']).month
order_data['order_approve_day'] = pd.DatetimeIndex(order_data['order_approved_at']).day
order_data['order_deliver_carrier_year'] = pd.DatetimeIndex(order_data['order_delivered_carrier_date']).year
order_data['order_deliver_carrier_month'] = pd.DatetimeIndex(order_data['order_delivered_carrier_date']).month
order_data['order_deliver_carrier_day'] = pd.DatetimeIndex(order_data['order_delivered_carrier_date']).day
order_data['order_deliver_customer_year'] = pd.DatetimeIndex(order_data['order_delivered_customer_date']).year
order_data['order_deliver_customer_month'] = pd.DatetimeIndex(order_data['order_delivered_customer_date']).month
order_data['order_deliver_customer_day'] = pd.DatetimeIndex(order_data['order_delivered_customer_date']).day
order_data['order_estimate_delivery_year'] = pd.DatetimeIndex(order_data['order_estimated_delivery_date']).year
order_data['order_estimate_delivery_month'] = pd.DatetimeIndex(order_data['order_estimated_delivery_date']).month
order_data['order_estimate_delivery_day'] = pd.DatetimeIndex(order_data['order_estimated_delivery_date']).day
<ipython-input-31-587b960f43c8>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_purchase_year'] = pd.DatetimeIndex(order_data['order_purchase_timestamp']).year
<ipython-input-31-587b960f43c8>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_purchase_month'] = pd.DatetimeIndex(order_data['order_purchase_timestamp']).month
<ipython-input-31-587b960f43c8>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_purchase_day'] = pd.DatetimeIndex(order_data['order_purchase_timestamp']).day
<ipython-input-31-587b960f43c8>:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_approve_year'] = pd.DatetimeIndex(order_data['order_approved_at']).year
<ipython-input-31-587b960f43c8>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_approve_month'] = pd.DatetimeIndex(order_data['order_approved_at']).month
<ipython-input-31-587b960f43c8>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_approve_day'] = pd.DatetimeIndex(order_data['order_approved_at']).day
<ipython-input-31-587b960f43c8>:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_deliver_carrier_year'] = pd.DatetimeIndex(order_data['order_delivered_carrier_date']).year
<ipython-input-31-587b960f43c8>:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_deliver_carrier_month'] = pd.DatetimeIndex(order_data['order_delivered_carrier_date']).month
<ipython-input-31-587b960f43c8>:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_deliver_carrier_day'] = pd.DatetimeIndex(order_data['order_delivered_carrier_date']).day
<ipython-input-31-587b960f43c8>:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_deliver_customer_year'] = pd.DatetimeIndex(order_data['order_delivered_customer_date']).year
<ipython-input-31-587b960f43c8>:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_deliver_customer_month'] = pd.DatetimeIndex(order_data['order_delivered_customer_date']).month
<ipython-input-31-587b960f43c8>:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_deliver_customer_day'] = pd.DatetimeIndex(order_data['order_delivered_customer_date']).day
<ipython-input-31-587b960f43c8>:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_estimate_delivery_year'] = pd.DatetimeIndex(order_data['order_estimated_delivery_date']).year
<ipython-input-31-587b960f43c8>:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_estimate_delivery_month'] = pd.DatetimeIndex(order_data['order_estimated_delivery_date']).month
<ipython-input-31-587b960f43c8>:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  order_data['order_estimate_delivery_day'] = pd.DatetimeIndex(order_data['order_estimated_delivery_date']).day

Cleaning Product Data

Some of the column names may be misspelled. Suppose some column names in the Brazilian e-commerce dataset are misspelled or incorrect. In that case, we can use the rename() function provided by the pandas' package to correct the names of the columns. Having correctly spelled and adequately formatted column names can be helpful in future merge and join operations on the data.

In [ ]:
product_data = product_data.rename(columns={'product_name_lenght':'product_name_length'})
product_data = product_data.rename(columns={'product_description_lenght':'product_description_length'})

Cleaning City Data

The City data in the Brazilian Cities dataset contains many columns, and we shall identify the ones that will be useful for our studies. For instance, the location, area, gdp, taxes, age groups of people, number of companies, and number of stores like Walmart will be helpful for our studies so that we can select and rename these proper columns for sales analytics.

We review the columns in the dataset and select those relevant to our research. It is also helpful to rename the columns chosen to indicate their purpose or content more clearly. This method will make it easier to refer to the columns in our analysis and improve the clarity and interpretability of our results.

Generally, it is best to consider factors such as the amount of data available for each column and the potential impact of missing or incomplete data on the results. This method will help ensure that our analysis is as comprehensive and accurate as possible. For instance, the number of Walmart stores may affect a city's buying or selling habits.

To replace commas with spaces in the column names of a dataset, we use the .replace() method in a Python program. This method allows us to specify the character or characters we want to replace and the replacement character or string. In addition to replacing commas with spaces, we use the .lower() method to convert the column names to lowercase. The reason is that we want to make the column names more consistent or easier to read.

Once we have replaced the commas and converted the column names to lowercase, we save the modified dataset to a new file or continue our analysis.

In [ ]:
city_data
Out[ ]:
CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+ IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG LAT ALT PAY_TV FIXED_PHONES AREA REGIAO_TUR CATEGORIA_TUR ESTIMATED_POP RURAL_URBAN GVA_TOTAL TAXES GDP POP_GDP GDP_CAPITA COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S COMP_T COMP_U Unnamed: 54 Unnamed: 55 Unnamed: 56 Unnamed: 57 Unnamed: 58 Unnamed: 59 Unnamed: 60 Unnamed: 61 Unnamed: 62 Unnamed: 63 Unnamed: 64 Unnamed: 65 Unnamed: 66 Unnamed: 67 Unnamed: 68 Unnamed: 69 Unnamed: 70 Unnamed: 71 Unnamed: 72 Unnamed: 73 Unnamed: 74 Unnamed: 75 Unnamed: 76 Unnamed: 77 Unnamed: 78 Unnamed: 79 Unnamed: 80 Unnamed: 81 Unnamed: 82 Unnamed: 83 Unnamed: 84 Unnamed: 85 Unnamed: 86 Unnamed: 87 Unnamed: 88 Unnamed: 89 Unnamed: 90 Unnamed: 91 Unnamed: 92 Unnamed: 93 Unnamed: 94 Unnamed: 95 Unnamed: 96 Unnamed: 97 Unnamed: 98 Unnamed: 99 Unnamed: 100 Unnamed: 101 Unnamed: 102 Unnamed: 103 Unnamed: 104 Unnamed: 105 Unnamed: 106 Unnamed: 107 Unnamed: 108 Unnamed: 109 Unnamed: 110 Unnamed: 111 Unnamed: 112 Unnamed: 113 Unnamed: 114 Unnamed: 115 Unnamed: 116 Unnamed: 117 Unnamed: 118 Unnamed: 119 Unnamed: 120 Unnamed: 121 Unnamed: 122 Unnamed: 123 Unnamed: 124 Unnamed: 125 Unnamed: 126 Unnamed: 127 Unnamed: 128 Unnamed: 129 Unnamed: 130 Unnamed: 131 Unnamed: 132 Unnamed: 133 Unnamed: 134 Unnamed: 135 Unnamed: 136 Unnamed: 137 Unnamed: 138 Unnamed: 139 Unnamed: 140 Unnamed: 141 Unnamed: 142 Unnamed: 143 Unnamed: 144 Unnamed: 145 Unnamed: 146 Unnamed: 147 Unnamed: 148 Unnamed: 149 Unnamed: 150 Unnamed: 151 Unnamed: 152 Unnamed: 153 Unnamed: 154 Unnamed: 155 Unnamed: 156 Unnamed: 157 Unnamed: 158 Unnamed: 159 Unnamed: 160 Unnamed: 161 Unnamed: 162 Unnamed: 163 Unnamed: 164 Unnamed: 165 Unnamed: 166 Unnamed: 167 Unnamed: 168 Unnamed: 169 Unnamed: 170 Unnamed: 171 Unnamed: 172 Unnamed: 173 Unnamed: 174 Unnamed: 175 Unnamed: 176 Unnamed: 177 Unnamed: 178 Unnamed: 179 Unnamed: 180 Unnamed: 181 Unnamed: 182 Unnamed: 183 Unnamed: 184 Unnamed: 185 Unnamed: 186 Unnamed: 187 Unnamed: 188 Unnamed: 189 Unnamed: 190 Unnamed: 191 Unnamed: 192 Unnamed: 193 Unnamed: 194 Unnamed: 195 Unnamed: 196 Unnamed: 197 Unnamed: 198 Unnamed: 199 Unnamed: 200 Unnamed: 201 Unnamed: 202 Unnamed: 203 Unnamed: 204 Unnamed: 205 Unnamed: 206 Unnamed: 207 Unnamed: 208 Unnamed: 209 Unnamed: 210 Unnamed: 211 Unnamed: 212 Unnamed: 213 Unnamed: 214 Unnamed: 215 Unnamed: 216 Unnamed: 217 Unnamed: 218 Unnamed: 219 Unnamed: 220 Unnamed: 221 Unnamed: 222 Unnamed: 223 Unnamed: 224 Unnamed: 225 Unnamed: 226 Unnamed: 227 Unnamed: 228
0 Abadia De Goiás GO 0 6876.0 6876.0 0.0 69.0 318.0 438.0 517.0 3542.0 416.0 1689.0 0.708 0.687 0.830 0.622 -49.440548 -16.758812 893.60 360.0 842.0 147.26 NaN NaN 8583.0 Urbano 145857.60 20554.20 166.41 8053.0 20664.57 284.0 5.0 1.0 56.0 0.0 2.0 29.0 110.0 26.0 4.0 5.0 0.0 2.0 10.0 12.0 4.0 6.0 6.0 1.0 5.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Abadia Dos Dourados MG 0 6704.0 6704.0 0.0 38.0 207.0 260.0 351.0 2709.0 589.0 2207.0 0.690 0.693 0.839 0.563 -47.396832 -18.487565 753.12 77.0 296.0 881.06 Caminhos Do Cerrado D 6972.0 Rural Adjacente 167215.28 12873.50 180.09 7037.0 25591.70 476.0 6.0 6.0 30.0 1.0 2.0 34.0 190.0 70.0 28.0 11.0 0.0 4.0 15.0 29.0 2.0 9.0 14.0 6.0 19.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Abadiânia GO 0 15757.0 15609.0 148.0 139.0 650.0 894.0 1087.0 6896.0 990.0 2202.0 0.690 0.671 0.841 0.579 -48.718812 -16.182672 1017.55 227.0 720.0 1,045.13 Região Turística Do Ouro E Cristais C 19614.0 Rural Adjacente 261161.91 26822.58 287984.49 18427.0 15628.40 288.0 5.0 9.0 26.0 0.0 2.0 7.0 117.0 12.0 57.0 2.0 1.0 0.0 7.0 15.0 3.0 11.0 5.0 1.0 8.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Abaeté MG 0 22690.0 22690.0 0.0 176.0 856.0 1233.0 1539.0 11979.0 2681.0 1994.0 0.698 0.720 0.848 0.556 -45.446191 -19.155848 644.74 1230.0 1716.0 1,817.07 Lago De Três Marias D 23223.0 Urbano 403241.27 26994.09 430235.36 23574.0 18250.42 621.0 18.0 1.0 40.0 0.0 1.0 20.0 303.0 62.0 30.0 9.0 6.0 4.0 28.0 27.0 2.0 15.0 19.0 9.0 27.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Abaetetuba PA 0 141100.0 141040.0 60.0 1354.0 5567.0 7618.0 8905.0 53516.0 5996.0 3530.0 0.628 0.579 0.798 0.537 -48.884404 -1.723470 10.12 3389.0 1218.0 1,610.65 Araguaia-Tocantins D 156292.0 Urbano 1154074.81 95180.48 1249255.29 151934.0 8222.36 931.0 4.0 2.0 43.0 0.0 1.0 27.0 500.0 16.0 31.0 6.0 1.0 1.0 22.0 16.0 2.0 155.0 33.0 15.0 56.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5568 Xique-Xique BA 0 45536.0 45536.0 0.0 518.0 2318.0 3095.0 3620.0 19371.0 3575.0 4533.0 0.585 0.563 0.741 0.479 -42.725508 -10.824974 406.26 169.0 1267.0 5,079.66 Chapada Diamantina D 46440.0 Urbano 328930.58 13330.49 342261.06 48274.0 7089.97 420.0 2.0 1.0 23.0 0.0 1.0 10.0 299.0 6.0 11.0 2.0 4.0 0.0 5.0 9.0 2.0 6.0 19.0 3.0 17.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5569 Zabelê PB 0 2075.0 2075.0 0.0 20.0 110.0 140.0 121.0 874.0 204.0 3639.0 0.623 0.567 0.725 0.587 -37.093552 -8.076874 646.34 2.0 17.0 109.39 Cariri E 2225.0 Rural Adjacente 18946.95 635.10 19582.06 2227.0 8793.02 20.0 0.0 0.0 2.0 0.0 0.0 1.0 8.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 2.0 1.0 0.0 0.0 5.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5570 Zacarias SP 0 2335.0 2335.0 0.0 21.0 104.0 123.0 127.0 1165.0 296.0 1072.0 0.730 0.695 0.826 0.674 -50.055740 -21.050110 415.85 155.0 181.0 319.06 NaN NaN 2684.0 Rural Adjacente 73.95 1988.33 75934.89 2605.0 29149.67 111.0 65.0 0.0 4.0 0.0 0.0 0.0 17.0 4.0 5.0 0.0 0.0 0.0 1.0 1.0 3.0 1.0 1.0 2.0 7.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5571 Zé Doca MA 0 50173.0 50152.0 21.0 541.0 2366.0 3087.0 3505.0 18136.0 3168.0 4272.0 0.595 0.559 0.745 0.505 -45.657698 -3.275481 35.66 6493.0 821.0 2,140.11 NaN NaN 51471.0 Urbano 357.90 22610.35 380509.30 50806.0 7489.46 470.0 3.0 0.0 14.0 0.0 0.0 17.0 334.0 5.0 13.0 4.0 0.0 2.0 6.0 9.0 2.0 11.0 9.0 3.0 38.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5572 Zortéa SC 0 2991.0 2991.0 0.0 31.0 132.0 211.0 190.0 1506.0 256.0 364.0 0.760 0.752 0.885 0.661 -51.549566 -27.450251 685.30 68.0 219.0 190.18 Vale Do Contestado E 3328.0 Intermediário Adjacente 63983.64 3014.65 66998.29 3264.0 20526.44 76.0 0.0 0.0 7.0 2.0 0.0 3.0 45.0 5.0 2.0 0.0 0.0 0.0 1.0 3.0 2.0 1.0 1.0 1.0 3.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5573 rows × 229 columns

In [ ]:
city_data = city_data[['CITY', 'STATE','LONG','LAT','ALT','AREA','TAXES','GDP','IBGE_RES_POP','IBGE_1','IBGE_1-4','IBGE_5-9','IBGE_10-14','IBGE_15-59','IBGE_60+', 'IDHM','PAY_TV','COMP_TOT','COMP_A','COMP_B','COMP_C','COMP_D','COMP_E','COMP_F','COMP_G','COMP_H','COMP_I','COMP_J','COMP_K','COMP_L','COMP_M','COMP_N','COMP_O','COMP_P','COMP_Q','COMP_R','COMP_S','COMP_T','COMP_U'
#,'WAL-MART'
]]
In [ ]:
city_data = city_data.rename(columns={'CITY':'city'})
city_data = city_data.rename(columns={'STATE':'state'})
city_data = city_data.rename(columns={'LONG':'longitude'})
city_data = city_data.rename(columns={'LAT':'latitude'})
city_data = city_data.rename(columns={'ALT':'altitude'})
city_data = city_data.rename(columns={'AREA':'area'})
city_data = city_data.rename(columns={'TAXES':'taxes'})
city_data = city_data.rename(columns={'GDP':'gdp'})
city_data = city_data.rename(columns={'IBGE_RES_POP':'res_population'})
city_data = city_data.rename(columns={'IBGE_1':'pop_below1'})
city_data = city_data.rename(columns={'IBGE_1-4':'pop_1_4'})
city_data = city_data.rename(columns={'IBGE_5-9':'pop_5_9'})
city_data = city_data.rename(columns={'IBGE_10-14':'pop_10_14'})
city_data = city_data.rename(columns={'IBGE_15-59':'pop_15_59'})
city_data = city_data.rename(columns={'IBGE_60+':'pop_60'})
city_data = city_data.rename(columns={'IDHM':'HDI'})
city_data = city_data.rename(columns={'PAY_TV':'pay_TV'})
city_data = city_data.rename(columns={'COM_TOT':'total_company'})
city_data = city_data.rename(columns={'WAL-MART':'walmart'})
In [ ]:
city_data['area'] = city_data['area'].str.replace(',', '').astype(float)
In [ ]:
city_data
Out[ ]:
city state longitude latitude altitude area taxes gdp res_population pop_below1 pop_1_4 pop_5_9 pop_10_14 pop_15_59 pop_60 HDI pay_TV COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S COMP_T COMP_U
0 Abadia De Goiás GO -49.440548 -16.758812 893.60 147.26 20554.20 166.41 6876.0 69.0 318.0 438.0 517.0 3542.0 416.0 0.708 360.0 284.0 5.0 1.0 56.0 0.0 2.0 29.0 110.0 26.0 4.0 5.0 0.0 2.0 10.0 12.0 4.0 6.0 6.0 1.0 5.0 0.0 0.0
1 Abadia Dos Dourados MG -47.396832 -18.487565 753.12 881.06 12873.50 180.09 6704.0 38.0 207.0 260.0 351.0 2709.0 589.0 0.690 77.0 476.0 6.0 6.0 30.0 1.0 2.0 34.0 190.0 70.0 28.0 11.0 0.0 4.0 15.0 29.0 2.0 9.0 14.0 6.0 19.0 0.0 0.0
2 Abadiânia GO -48.718812 -16.182672 1017.55 1045.13 26822.58 287984.49 15757.0 139.0 650.0 894.0 1087.0 6896.0 990.0 0.690 227.0 288.0 5.0 9.0 26.0 0.0 2.0 7.0 117.0 12.0 57.0 2.0 1.0 0.0 7.0 15.0 3.0 11.0 5.0 1.0 8.0 0.0 0.0
3 Abaeté MG -45.446191 -19.155848 644.74 1817.07 26994.09 430235.36 22690.0 176.0 856.0 1233.0 1539.0 11979.0 2681.0 0.698 1230.0 621.0 18.0 1.0 40.0 0.0 1.0 20.0 303.0 62.0 30.0 9.0 6.0 4.0 28.0 27.0 2.0 15.0 19.0 9.0 27.0 0.0 0.0
4 Abaetetuba PA -48.884404 -1.723470 10.12 1610.65 95180.48 1249255.29 141100.0 1354.0 5567.0 7618.0 8905.0 53516.0 5996.0 0.628 3389.0 931.0 4.0 2.0 43.0 0.0 1.0 27.0 500.0 16.0 31.0 6.0 1.0 1.0 22.0 16.0 2.0 155.0 33.0 15.0 56.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5568 Xique-Xique BA -42.725508 -10.824974 406.26 5079.66 13330.49 342261.06 45536.0 518.0 2318.0 3095.0 3620.0 19371.0 3575.0 0.585 169.0 420.0 2.0 1.0 23.0 0.0 1.0 10.0 299.0 6.0 11.0 2.0 4.0 0.0 5.0 9.0 2.0 6.0 19.0 3.0 17.0 0.0 0.0
5569 Zabelê PB -37.093552 -8.076874 646.34 109.39 635.10 19582.06 2075.0 20.0 110.0 140.0 121.0 874.0 204.0 0.623 2.0 20.0 0.0 0.0 2.0 0.0 0.0 1.0 8.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 2.0 1.0 0.0 0.0 5.0 0.0 0.0
5570 Zacarias SP -50.055740 -21.050110 415.85 319.06 1988.33 75934.89 2335.0 21.0 104.0 123.0 127.0 1165.0 296.0 0.730 155.0 111.0 65.0 0.0 4.0 0.0 0.0 0.0 17.0 4.0 5.0 0.0 0.0 0.0 1.0 1.0 3.0 1.0 1.0 2.0 7.0 0.0 0.0
5571 Zé Doca MA -45.657698 -3.275481 35.66 2140.11 22610.35 380509.30 50173.0 541.0 2366.0 3087.0 3505.0 18136.0 3168.0 0.595 6493.0 470.0 3.0 0.0 14.0 0.0 0.0 17.0 334.0 5.0 13.0 4.0 0.0 2.0 6.0 9.0 2.0 11.0 9.0 3.0 38.0 0.0 0.0
5572 Zortéa SC -51.549566 -27.450251 685.30 190.18 3014.65 66998.29 2991.0 31.0 132.0 211.0 190.0 1506.0 256.0 0.760 68.0 76.0 0.0 0.0 7.0 2.0 0.0 3.0 45.0 5.0 2.0 0.0 0.0 0.0 1.0 3.0 2.0 1.0 1.0 1.0 3.0 0.0 0.0

5573 rows × 39 columns

In [ ]:
city_data
Out[ ]:
city state longitude latitude altitude area taxes gdp res_population pop_below1 pop_1_4 pop_5_9 pop_10_14 pop_15_59 pop_60 HDI pay_TV COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S COMP_T COMP_U
0 Abadia De Goiás GO -49.440548 -16.758812 893.60 147.26 20554.20 166.41 6876.0 69.0 318.0 438.0 517.0 3542.0 416.0 0.708 360.0 284.0 5.0 1.0 56.0 0.0 2.0 29.0 110.0 26.0 4.0 5.0 0.0 2.0 10.0 12.0 4.0 6.0 6.0 1.0 5.0 0.0 0.0
1 Abadia Dos Dourados MG -47.396832 -18.487565 753.12 881.06 12873.50 180.09 6704.0 38.0 207.0 260.0 351.0 2709.0 589.0 0.690 77.0 476.0 6.0 6.0 30.0 1.0 2.0 34.0 190.0 70.0 28.0 11.0 0.0 4.0 15.0 29.0 2.0 9.0 14.0 6.0 19.0 0.0 0.0
2 Abadiânia GO -48.718812 -16.182672 1017.55 1045.13 26822.58 287984.49 15757.0 139.0 650.0 894.0 1087.0 6896.0 990.0 0.690 227.0 288.0 5.0 9.0 26.0 0.0 2.0 7.0 117.0 12.0 57.0 2.0 1.0 0.0 7.0 15.0 3.0 11.0 5.0 1.0 8.0 0.0 0.0
3 Abaeté MG -45.446191 -19.155848 644.74 1817.07 26994.09 430235.36 22690.0 176.0 856.0 1233.0 1539.0 11979.0 2681.0 0.698 1230.0 621.0 18.0 1.0 40.0 0.0 1.0 20.0 303.0 62.0 30.0 9.0 6.0 4.0 28.0 27.0 2.0 15.0 19.0 9.0 27.0 0.0 0.0
4 Abaetetuba PA -48.884404 -1.723470 10.12 1610.65 95180.48 1249255.29 141100.0 1354.0 5567.0 7618.0 8905.0 53516.0 5996.0 0.628 3389.0 931.0 4.0 2.0 43.0 0.0 1.0 27.0 500.0 16.0 31.0 6.0 1.0 1.0 22.0 16.0 2.0 155.0 33.0 15.0 56.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5568 Xique-Xique BA -42.725508 -10.824974 406.26 5079.66 13330.49 342261.06 45536.0 518.0 2318.0 3095.0 3620.0 19371.0 3575.0 0.585 169.0 420.0 2.0 1.0 23.0 0.0 1.0 10.0 299.0 6.0 11.0 2.0 4.0 0.0 5.0 9.0 2.0 6.0 19.0 3.0 17.0 0.0 0.0
5569 Zabelê PB -37.093552 -8.076874 646.34 109.39 635.10 19582.06 2075.0 20.0 110.0 140.0 121.0 874.0 204.0 0.623 2.0 20.0 0.0 0.0 2.0 0.0 0.0 1.0 8.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 2.0 1.0 0.0 0.0 5.0 0.0 0.0
5570 Zacarias SP -50.055740 -21.050110 415.85 319.06 1988.33 75934.89 2335.0 21.0 104.0 123.0 127.0 1165.0 296.0 0.730 155.0 111.0 65.0 0.0 4.0 0.0 0.0 0.0 17.0 4.0 5.0 0.0 0.0 0.0 1.0 1.0 3.0 1.0 1.0 2.0 7.0 0.0 0.0
5571 Zé Doca MA -45.657698 -3.275481 35.66 2140.11 22610.35 380509.30 50173.0 541.0 2366.0 3087.0 3505.0 18136.0 3168.0 0.595 6493.0 470.0 3.0 0.0 14.0 0.0 0.0 17.0 334.0 5.0 13.0 4.0 0.0 2.0 6.0 9.0 2.0 11.0 9.0 3.0 38.0 0.0 0.0
5572 Zortéa SC -51.549566 -27.450251 685.30 190.18 3014.65 66998.29 2991.0 31.0 132.0 211.0 190.0 1506.0 256.0 0.760 68.0 76.0 0.0 0.0 7.0 2.0 0.0 3.0 45.0 5.0 2.0 0.0 0.0 0.0 1.0 3.0 2.0 1.0 1.0 1.0 3.0 0.0 0.0

5573 rows × 39 columns

In [ ]:
city_data['city'] = city_data['city'].str.lower()
In [ ]:
city_data
Out[ ]:
city state longitude latitude altitude area taxes gdp res_population pop_below1 pop_1_4 pop_5_9 pop_10_14 pop_15_59 pop_60 HDI pay_TV COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S COMP_T COMP_U
0 abadia de goiás GO -49.440548 -16.758812 893.60 147.26 20554.20 166.41 6876.0 69.0 318.0 438.0 517.0 3542.0 416.0 0.708 360.0 284.0 5.0 1.0 56.0 0.0 2.0 29.0 110.0 26.0 4.0 5.0 0.0 2.0 10.0 12.0 4.0 6.0 6.0 1.0 5.0 0.0 0.0
1 abadia dos dourados MG -47.396832 -18.487565 753.12 881.06 12873.50 180.09 6704.0 38.0 207.0 260.0 351.0 2709.0 589.0 0.690 77.0 476.0 6.0 6.0 30.0 1.0 2.0 34.0 190.0 70.0 28.0 11.0 0.0 4.0 15.0 29.0 2.0 9.0 14.0 6.0 19.0 0.0 0.0
2 abadiânia GO -48.718812 -16.182672 1017.55 1045.13 26822.58 287984.49 15757.0 139.0 650.0 894.0 1087.0 6896.0 990.0 0.690 227.0 288.0 5.0 9.0 26.0 0.0 2.0 7.0 117.0 12.0 57.0 2.0 1.0 0.0 7.0 15.0 3.0 11.0 5.0 1.0 8.0 0.0 0.0
3 abaeté MG -45.446191 -19.155848 644.74 1817.07 26994.09 430235.36 22690.0 176.0 856.0 1233.0 1539.0 11979.0 2681.0 0.698 1230.0 621.0 18.0 1.0 40.0 0.0 1.0 20.0 303.0 62.0 30.0 9.0 6.0 4.0 28.0 27.0 2.0 15.0 19.0 9.0 27.0 0.0 0.0
4 abaetetuba PA -48.884404 -1.723470 10.12 1610.65 95180.48 1249255.29 141100.0 1354.0 5567.0 7618.0 8905.0 53516.0 5996.0 0.628 3389.0 931.0 4.0 2.0 43.0 0.0 1.0 27.0 500.0 16.0 31.0 6.0 1.0 1.0 22.0 16.0 2.0 155.0 33.0 15.0 56.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5568 xique-xique BA -42.725508 -10.824974 406.26 5079.66 13330.49 342261.06 45536.0 518.0 2318.0 3095.0 3620.0 19371.0 3575.0 0.585 169.0 420.0 2.0 1.0 23.0 0.0 1.0 10.0 299.0 6.0 11.0 2.0 4.0 0.0 5.0 9.0 2.0 6.0 19.0 3.0 17.0 0.0 0.0
5569 zabelê PB -37.093552 -8.076874 646.34 109.39 635.10 19582.06 2075.0 20.0 110.0 140.0 121.0 874.0 204.0 0.623 2.0 20.0 0.0 0.0 2.0 0.0 0.0 1.0 8.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 2.0 1.0 0.0 0.0 5.0 0.0 0.0
5570 zacarias SP -50.055740 -21.050110 415.85 319.06 1988.33 75934.89 2335.0 21.0 104.0 123.0 127.0 1165.0 296.0 0.730 155.0 111.0 65.0 0.0 4.0 0.0 0.0 0.0 17.0 4.0 5.0 0.0 0.0 0.0 1.0 1.0 3.0 1.0 1.0 2.0 7.0 0.0 0.0
5571 zé doca MA -45.657698 -3.275481 35.66 2140.11 22610.35 380509.30 50173.0 541.0 2366.0 3087.0 3505.0 18136.0 3168.0 0.595 6493.0 470.0 3.0 0.0 14.0 0.0 0.0 17.0 334.0 5.0 13.0 4.0 0.0 2.0 6.0 9.0 2.0 11.0 9.0 3.0 38.0 0.0 0.0
5572 zortéa SC -51.549566 -27.450251 685.30 190.18 3014.65 66998.29 2991.0 31.0 132.0 211.0 190.0 1506.0 256.0 0.760 68.0 76.0 0.0 0.0 7.0 2.0 0.0 3.0 45.0 5.0 2.0 0.0 0.0 0.0 1.0 3.0 2.0 1.0 1.0 1.0 3.0 0.0 0.0

5573 rows × 39 columns

Cleaning Category Data

The category data also need modification in column names. Similarly, we use replace() to make the column names cleaner. Once we have cleaned up the column names in our dataset, we can use them for joins with other datasets or continue with our analysis. It is generally a good idea to ensure that our column names are straightforward and informative, as this can help make our research more efficient and effective.

In [ ]:
category_data
Out[ ]:
product_category_name product_category_name_english
0 beleza_saude health_beauty
1 informatica_acessorios computers_accessories
2 automotivo auto
3 cama_mesa_banho bed_bath_table
4 moveis_decoracao furniture_decor
... ... ...
66 flores flowers
67 artes_e_artesanato arts_and_craftmanship
68 fraldas_higiene diapers_and_hygiene
69 fashion_roupa_infanto_juvenil fashion_childrens_clothes
70 seguros_e_servicos security_and_services

71 rows × 2 columns

In [ ]:
category_data['product_category_name'] = category_data['product_category_name'].str.replace('_', ' ')
In [ ]:
category_data['product_category_name_english'] = category_data['product_category_name_english'].str.replace('_', ' ')
In [ ]:
product_data
Out[ ]:
product_id product_category_name product_name_length product_description_length product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm
0 1e9e8ef04dbcff4541ed26657ea517e5 perfumaria 40.0 287.0 1.0 225.0 16.0 10.0 14.0
1 3aa071139cb16b67ca9e5dea641aaa2f artes 44.0 276.0 1.0 1000.0 30.0 18.0 20.0
2 96bd76ec8810374ed1b65e291975717f esporte_lazer 46.0 250.0 1.0 154.0 18.0 9.0 15.0
3 cef67bcfe19066a932b7673e239eb23d bebes 27.0 261.0 1.0 371.0 26.0 4.0 26.0
4 9dc1a7de274444849c219cff195d0b71 utilidades_domesticas 37.0 402.0 4.0 625.0 20.0 17.0 13.0
... ... ... ... ... ... ... ... ... ...
32946 a0b7d5a992ccda646f2d34e418fff5a0 moveis_decoracao 45.0 67.0 2.0 12300.0 40.0 40.0 40.0
32947 bf4538d88321d0fd4412a93c974510e6 construcao_ferramentas_iluminacao 41.0 971.0 1.0 1700.0 16.0 19.0 16.0
32948 9a7c6041fa9592d9d9ef6cfe62a71f8c cama_mesa_banho 50.0 799.0 1.0 1400.0 27.0 7.0 27.0
32949 83808703fc0706a22e264b9d75f04a2e informatica_acessorios 60.0 156.0 2.0 700.0 31.0 13.0 20.0
32950 106392145fca363410d287a815be6de4 cama_mesa_banho 58.0 309.0 1.0 2083.0 12.0 2.0 7.0

32340 rows × 9 columns

Dataset Summary

We want an overview of our dataset, including the column names and relevant information about each column, so we use the .info() and .describe() methods in a Python program.

The .info() method will provide a summary of the data frame, including the number of rows and columns, the data type for each column, and the number of non-null values. This info can help identify any missing or incomplete data in our dataset and understand the general structure of the data.

On the other hand, the .describe() method will provide more detailed statistics about the numeric columns in our dataset. This method can include measures such as the mean, standard deviation, minimum, and maximum values for each numeric column and the number of unique and missing values. This way can help us understand the distribution and range of values in our dataset and can help us identify any potential outliers or anomalies in the data.

Using these methods, we can get a valuable overview of the cleaned dataset, help with our plan, and prepare for further analysis.

In [ ]:
customer_data.to_csv('customer.csv')
!cp data.csv "drive/My Drive/"
cp: cannot stat 'data.csv': No such file or directory
In [ ]:
from google.colab import files
customer_data.to_csv('customer.csv') 
files.download('customer.csv')
In [ ]:
customer_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 4.6+ MB
In [ ]:
order_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 96461 entries, 0 to 99440
Data columns (total 23 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       96461 non-null  object
 1   customer_id                    96461 non-null  object
 2   order_status                   96461 non-null  object
 3   order_purchase_timestamp       96461 non-null  object
 4   order_approved_at              96461 non-null  object
 5   order_delivered_carrier_date   96461 non-null  object
 6   order_delivered_customer_date  96461 non-null  object
 7   order_estimated_delivery_date  96461 non-null  object
 8   order_purchase_year            96461 non-null  int64 
 9   order_purchase_month           96461 non-null  int64 
 10  order_purchase_day             96461 non-null  int64 
 11  order_approve_year             96461 non-null  int64 
 12  order_approve_month            96461 non-null  int64 
 13  order_approve_day              96461 non-null  int64 
 14  order_deliver_carrier_year     96461 non-null  int64 
 15  order_deliver_carrier_month    96461 non-null  int64 
 16  order_deliver_carrier_day      96461 non-null  int64 
 17  order_deliver_customer_year    96461 non-null  int64 
 18  order_deliver_customer_month   96461 non-null  int64 
 19  order_deliver_customer_day     96461 non-null  int64 
 20  order_estimate_delivery_year   96461 non-null  int64 
 21  order_estimate_delivery_month  96461 non-null  int64 
 22  order_estimate_delivery_day    96461 non-null  int64 
dtypes: int64(15), object(8)
memory usage: 17.7+ MB
In [ ]:
from google.colab import files
order_data.to_csv('order.csv') 
files.download('order.csv')
In [ ]:
items_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   order_id             112650 non-null  object 
 1   order_item_id        112650 non-null  int64  
 2   product_id           112650 non-null  object 
 3   seller_id            112650 non-null  object 
 4   shipping_limit_date  112650 non-null  object 
 5   price                112650 non-null  float64
 6   freight_value        112650 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 6.0+ MB
In [ ]:
from google.colab import files
items_data.to_csv('items.csv') 
files.download('items.csv')
In [ ]:
product_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32340 entries, 0 to 32950
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   product_id                  32340 non-null  object 
 1   product_category_name       32340 non-null  object 
 2   product_name_length         32340 non-null  float64
 3   product_description_length  32340 non-null  float64
 4   product_photos_qty          32340 non-null  float64
 5   product_weight_g            32340 non-null  float64
 6   product_length_cm           32340 non-null  float64
 7   product_height_cm           32340 non-null  float64
 8   product_width_cm            32340 non-null  float64
dtypes: float64(7), object(2)
memory usage: 2.5+ MB
In [ ]:
from google.colab import files
product_data.to_csv('product.csv') 
files.download('product.csv')
In [ ]:
city_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5573 entries, 0 to 5572
Data columns (total 39 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   city            5573 non-null   object 
 1   state           5573 non-null   object 
 2   longitude       5564 non-null   float64
 3   latitude        5564 non-null   float64
 4   altitude        5564 non-null   float64
 5   area            5570 non-null   float64
 6   taxes           5570 non-null   float64
 7   gdp             5570 non-null   float64
 8   res_population  5565 non-null   float64
 9   pop_below1      5565 non-null   float64
 10  pop_1_4         5565 non-null   float64
 11  pop_5_9         5565 non-null   float64
 12  pop_10_14       5565 non-null   float64
 13  pop_15_59       5565 non-null   float64
 14  pop_60          5565 non-null   float64
 15  HDI             5565 non-null   float64
 16  pay_TV          5570 non-null   float64
 17  COMP_TOT        5570 non-null   float64
 18  COMP_A          5570 non-null   float64
 19  COMP_B          5570 non-null   float64
 20  COMP_C          5570 non-null   float64
 21  COMP_D          5570 non-null   float64
 22  COMP_E          5570 non-null   float64
 23  COMP_F          5570 non-null   float64
 24  COMP_G          5570 non-null   float64
 25  COMP_H          5570 non-null   float64
 26  COMP_I          5570 non-null   float64
 27  COMP_J          5570 non-null   float64
 28  COMP_K          5570 non-null   float64
 29  COMP_L          5570 non-null   float64
 30  COMP_M          5570 non-null   float64
 31  COMP_N          5570 non-null   float64
 32  COMP_O          5570 non-null   float64
 33  COMP_P          5570 non-null   float64
 34  COMP_Q          5570 non-null   float64
 35  COMP_R          5570 non-null   float64
 36  COMP_S          5570 non-null   float64
 37  COMP_T          5570 non-null   float64
 38  COMP_U          5570 non-null   float64
dtypes: float64(37), object(2)
memory usage: 1.7+ MB
In [ ]:
from google.colab import files
city_data.to_csv('city.csv') 
files.download('city.csv')
In [ ]:
customer_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 4.6+ MB
In [ ]:
from google.colab import files
customer_data.to_csv('customer.csv') 
files.download('customer.csv')
In [ ]:
geolocation_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000163 entries, 0 to 1000162
Data columns (total 5 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   geolocation_zip_code_prefix  1000163 non-null  int64  
 1   geolocation_lat              1000163 non-null  float64
 2   geolocation_lng              1000163 non-null  float64
 3   geolocation_city             1000163 non-null  object 
 4   geolocation_state            1000163 non-null  object 
dtypes: float64(2), int64(1), object(2)
memory usage: 45.8+ MB
In [ ]:
from google.colab import files
geolocation_data.to_csv('geolocation.csv') 
files.download('geolocation.csv')
In [ ]:
payment_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 103886 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103886 non-null  object 
 1   payment_sequential    103886 non-null  int64  
 2   payment_type          103886 non-null  object 
 3   payment_installments  103886 non-null  int64  
 4   payment_value         103886 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.8+ MB
In [ ]:
from google.colab import files
payment_data.to_csv('payment.csv') 
files.download('payment.csv')
In [ ]:
review_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9839 entries, 9 to 99200
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   review_id                9839 non-null   object
 1   order_id                 9839 non-null   object
 2   review_score             9839 non-null   int64 
 3   review_comment_title     9839 non-null   object
 4   review_comment_message   9839 non-null   object
 5   review_creation_date     9839 non-null   object
 6   review_answer_timestamp  9839 non-null   object
dtypes: int64(1), object(6)
memory usage: 614.9+ KB
In [ ]:
from google.colab import files
review_data.to_csv('review.csv') 
files.download('review.csv')
In [ ]:
seller_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3095 entries, 0 to 3094
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   seller_id               3095 non-null   object
 1   seller_zip_code_prefix  3095 non-null   int64 
 2   seller_city             3095 non-null   object
 3   seller_state            3095 non-null   object
dtypes: int64(1), object(3)
memory usage: 120.9+ KB
In [ ]:
from google.colab import files
seller_data.to_csv('seller.csv') 
files.download('seller.csv')
In [ ]:
category_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 71 entries, 0 to 70
Data columns (total 2 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   product_category_name          71 non-null     object
 1   product_category_name_english  71 non-null     object
dtypes: object(2)
memory usage: 1.7+ KB
In [ ]:
from google.colab import files
category_data.to_csv('category.csv') 
files.download('category.csv')
In [ ]:
order_items_data.agg(
    {
        "price": ["min", "max", "median", "mean", "std", "skew"],
        "freight_value": ["min", "max", "median", "mean", "std", "skew"],
    }
)
Out[ ]:
price freight_value
min 0.850000 0.000000
max 6735.000000 409.680000
median 74.990000 16.260000
mean 120.653739 19.990320
std 183.633928 15.806405
skew 7.923208 5.639870
In [ ]:
product_data.agg(
    {
        "product_name_length": ["min", "max", "median", "mean", "std", "skew"],
        "product_description_length": ["min", "max", "median", "mean", "std", "skew"],
        "product_photos_qty": ["min", "max", "median", "mean", "std", "skew"],
        "product_weight_g": ["min", "max", "median", "mean", "std", "skew"],
        "product_length_cm": ["min", "max", "median", "mean", "std", "skew"],
        "product_height_cm": ["min", "max", "median", "mean", "std", "skew"],
        "product_width_cm": ["min", "max", "median", "mean", "std", "skew"],
    }
)
Out[ ]:
product_name_length product_description_length product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm
min 5.000000 4.000000 1.000000 0.000000 7.000000 2.000000 6.000000
max 76.000000 3992.000000 20.000000 40425.000000 105.000000 105.000000 118.000000
median 51.000000 595.000000 1.000000 700.000000 25.000000 13.000000 20.000000
mean 48.476592 771.492393 2.188961 2276.956586 30.854545 16.958813 23.208596
std 10.245699 635.124831 1.736787 4279.291845 16.955965 13.636115 12.078762
skew -0.903200 1.962078 2.193438 3.607632 1.750295 2.148907 1.678041

Merging Datasets

The product and product category tables are strongly related to each other. We use the pandas' library and the merge() method to merge two datasets in Python. This method allows us to combine the data from two separate datasets into a single data frame by matching the values in one or more common columns.

After merging, the resulting data frame will include all of the columns from both tables and any additional columns created as a result of the merge.

Once we have merged the two datasets, we can use the resulting data frame for further analysis or generate reports or visualizations. This method can provide a complete view of our data and help us understand the relationships between products and product categories.

Finally, we gain an overview of each product info like the ids, volumes, as well as their categories.

In [ ]:
product_data
Out[ ]:
product_id product_category_name product_name_length product_description_length product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm
0 1e9e8ef04dbcff4541ed26657ea517e5 perfumaria 40.0 287.0 1.0 225.0 16.0 10.0 14.0
1 3aa071139cb16b67ca9e5dea641aaa2f artes 44.0 276.0 1.0 1000.0 30.0 18.0 20.0
2 96bd76ec8810374ed1b65e291975717f esporte_lazer 46.0 250.0 1.0 154.0 18.0 9.0 15.0
3 cef67bcfe19066a932b7673e239eb23d bebes 27.0 261.0 1.0 371.0 26.0 4.0 26.0
4 9dc1a7de274444849c219cff195d0b71 utilidades_domesticas 37.0 402.0 4.0 625.0 20.0 17.0 13.0
... ... ... ... ... ... ... ... ... ...
32946 a0b7d5a992ccda646f2d34e418fff5a0 moveis_decoracao 45.0 67.0 2.0 12300.0 40.0 40.0 40.0
32947 bf4538d88321d0fd4412a93c974510e6 construcao_ferramentas_iluminacao 41.0 971.0 1.0 1700.0 16.0 19.0 16.0
32948 9a7c6041fa9592d9d9ef6cfe62a71f8c cama_mesa_banho 50.0 799.0 1.0 1400.0 27.0 7.0 27.0
32949 83808703fc0706a22e264b9d75f04a2e informatica_acessorios 60.0 156.0 2.0 700.0 31.0 13.0 20.0
32950 106392145fca363410d287a815be6de4 cama_mesa_banho 58.0 309.0 1.0 2083.0 12.0 2.0 7.0

32340 rows × 9 columns

In [ ]:
category_data
Out[ ]:
product_category_name product_category_name_english
0 beleza saude health beauty
1 informatica acessorios computers accessories
2 automotivo auto
3 cama mesa banho bed bath table
4 moveis decoracao furniture decor
... ... ...
66 flores flowers
67 artes e artesanato arts and craftmanship
68 fraldas higiene diapers and hygiene
69 fashion roupa infanto juvenil fashion childrens clothes
70 seguros e servicos security and services

71 rows × 2 columns

In [ ]:
import sqlite3
import pandasql as ps #SQL on Pandas Dataframe
In [ ]:
# product_id, product_category_name, product_name_length, product_description_length, product_photos_qty
# product_weight_g, product_length_cm, product_height_cm, product_width_cm, product_category_name_english
query = """
SELECT product_id, product_name_length, product_description_length, product_photos_qty,
product_weight_g, product_length_cm, product_height_cm, product_width_cm,category_data.product_category_name_english
FROM product_data
JOIN category_data
ON product_data.product_category_name = category_data.product_category_name
"""
cleaned_product_data= ps.sqldf(query, locals())
cleaned_product_data
Out[ ]:
product_id product_name_length product_description_length product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm product_category_name_english
0 1e9e8ef04dbcff4541ed26657ea517e5 40.0 287.0 1.0 225.0 16.0 10.0 14.0 perfumery
1 3aa071139cb16b67ca9e5dea641aaa2f 44.0 276.0 1.0 1000.0 30.0 18.0 20.0 art
2 cef67bcfe19066a932b7673e239eb23d 27.0 261.0 1.0 371.0 26.0 4.0 26.0 baby
3 37cc742be07708b53a98702e77a21a02 57.0 163.0 1.0 400.0 27.0 13.0 17.0 home appliances
4 8c92109888e8cdf9d66dc7e463025574 36.0 1156.0 1.0 600.0 17.0 10.0 12.0 toys
... ... ... ... ... ... ... ... ... ...
8664 8e5e831d742acedb1758e90e18974231 55.0 520.0 1.0 550.0 20.0 15.0 15.0 auto
8665 0bf1dea484fbb8cdfa09e2767ce30574 53.0 225.0 1.0 1920.0 37.0 6.0 32.0 food
8666 d4484cc239fbd0ac671ab04d931edc66 42.0 414.0 2.0 150.0 23.0 15.0 15.0 baby
8667 d218a47759ef0d1db44044934909b88b 50.0 778.0 1.0 350.0 17.0 11.0 11.0 telephony
8668 6ec96c91757fad0aecafc0ee7f262dcc 62.0 1417.0 1.0 9550.0 36.0 35.0 35.0 baby

8669 rows × 9 columns

API QUERIES

We use pandas SQL for API queries. For more detailed API queries can refer to our API documents.

The panda SQL library allows us to use SQL-like syntax to query and manipulate data frames in Python. This method can be helpful if we are more familiar with SQL or want to use SQL-like queries to work more efficiently and intuitively with our data.

To use panda SQL, we first need to import the sqldf function from the library, which we have done previously. This function allows us to run SQL-like queries on our data frames using a syntax similar to standard SQL.

In addition to SELECT, FROM, and WHERE, panda SQL supports other common SQL keywords and operators, such as GROUP BY, HAVING, ORDER BY, JOIN, and more. We use these keywords and operators to perform more complex queries and manipulations on our data frames.

We use the SELECT and WHERE clauses in a SQL query to search for a specific seller using their ID. The SELECT clause is used to specify which columns we want to retrieve from the table, and the WHERE clause is used to determine the conditions the data must meet to be included in the result set. "SELECT " would return all columns (indicated by the ) for the seller with the specified ID. We can also specify specific columns in the SELECT clause to only retrieve particular fields rather than all of them.

Similarly, we use this method to filter sellers by city, by state. And search customers by city and by state. And likewise, for filtering orders by id, customer id, and status.

Filtering sellers by city by state and searching customers by city and state are standard methods to organize and manage customer and seller information in a database. This method can help analyze sales data and identify trends or patterns based on location. Filtering orders by ID, customer ID, and status can also be helpful for tracking and managing orders within a business. These methods allow for efficient retrieval and organization of information, which can be beneficial for various purposes, such as managing inventory, analyzing sales data, and providing customer support.

Search seller by id
In [ ]:
seller_data
Out[ ]:
seller_id seller_zip_code_prefix seller_city seller_state
0 3442f8959a84dea7ee197c632cb2df15 13023 campinas SP
1 d1b65fc7debc3361ea86b5f14c68d2e2 13844 mogi guacu SP
2 ce3ad9de960102d0677a81f5d0bb7b2d 20031 rio de janeiro RJ
3 c0f3eea2e14555b6faeea3dd58c1b1c3 4195 sao paulo SP
4 51a04a8a6bdcb23deccc82b0b80742cf 12914 braganca paulista SP
... ... ... ... ...
3090 98dddbc4601dd4443ca174359b237166 87111 sarandi PR
3091 f8201cab383e484733266d1906e2fdfa 88137 palhoca SC
3092 74871d19219c7d518d0090283e03c137 4650 sao paulo SP
3093 e603cf3fec55f8697c9059638d6c8eb5 96080 pelotas RS
3094 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP

3095 rows × 4 columns

In [ ]:
query1 = """
SELECT *
FROM seller_data
where seller_id = "51a04a8a6bdcb23deccc82b0b80742cf"
"""
df1= ps.sqldf(query1, locals())
df1
Out[ ]:
seller_id seller_zip_code_prefix seller_city seller_state
0 51a04a8a6bdcb23deccc82b0b80742cf 12914 braganca paulista SP
Filter seller by city
In [ ]:
query2 = """
SELECT *
FROM seller_data
where seller_city = "braganca paulista"
"""
df2= ps.sqldf(query2, locals())
df2
Out[ ]:
seller_id seller_zip_code_prefix seller_city seller_state
0 51a04a8a6bdcb23deccc82b0b80742cf 12914 braganca paulista SP
1 c7c7b2682c648339a0e9846bb1023a39 12916 braganca paulista SP
2 f44bdc12aa39f5271f0b95f66cd4c289 12914 braganca paulista SP
3 7b0df942f46435babab05d49b744b2c4 12924 braganca paulista SP
4 e24fc9fcd865784fb25705606fe3dfe7 12913 braganca paulista SP
Filter seller by state
In [ ]:
query3 = """
SELECT *
FROM seller_data
where seller_state = "SP"
"""
df3= ps.sqldf(query3, locals())
df3
Out[ ]:
seller_id seller_zip_code_prefix seller_city seller_state
0 3442f8959a84dea7ee197c632cb2df15 13023 campinas SP
1 d1b65fc7debc3361ea86b5f14c68d2e2 13844 mogi guacu SP
2 c0f3eea2e14555b6faeea3dd58c1b1c3 4195 sao paulo SP
3 51a04a8a6bdcb23deccc82b0b80742cf 12914 braganca paulista SP
4 1b938a7ec6ac5061a66a3766e0e75f90 16304 penapolis SP
... ... ... ... ...
1844 f00f5b35d0abcacbdd863672f4bb2c1a 1238 sao paulo SP
1845 7bac63f6603d382cc8d0832eb6c100a8 5713 sao paulo SP
1846 f1fdf2d13186575751aa25876536d85c 5314 sao paulo SP
1847 74871d19219c7d518d0090283e03c137 4650 sao paulo SP
1848 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP

1849 rows × 4 columns

Search Customer by id
In [ ]:
customer_data
Out[ ]:
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP
1 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 9790 sao bernardo do campo SP
2 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 1151 sao paulo SP
3 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 8775 mogi das cruzes SP
4 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP
... ... ... ... ... ...
99436 17ddf5dd5d51696bb3d7c6291687be6f 1a29b476fee25c95fbafc67c5ac95cf8 3937 sao paulo SP
99437 e7b71a9017aa05c9a7fd292d714858e8 d52a67c98be1cf6a5c84435bd38d095d 6764 taboao da serra SP
99438 5e28dfe12db7fb50a4b2f691faecea5e e9f50caf99f032f0bf3c55141f019d99 60115 fortaleza CE
99439 56b18e2166679b8a959d72dd06da27f9 73c2643a0a458b49f58cea58833b192e 92120 canoas RS
99440 274fa6071e5e17fe303b9748641082c8 84732c5050c01db9b23e19ba39899398 6703 cotia SP

99441 rows × 5 columns

In [ ]:
query4 = """
SELECT *
FROM customer_data
where customer_unique_id = "84732c5050c01db9b23e19ba39899398"
"""
df4= ps.sqldf(query4, locals())
df4
Out[ ]:
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
0 274fa6071e5e17fe303b9748641082c8 84732c5050c01db9b23e19ba39899398 6703 cotia SP
Filter customer by city
In [ ]:
query5 = """
SELECT *
FROM customer_data
where customer_city = "taboao da serra"
"""
df5= ps.sqldf(query5, locals())
df5
Out[ ]:
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
0 ce330ffe33e3f344ad5a80b8410c1ced 4df9aee56fa602c72756abb22ee60b17 6764 taboao da serra SP
1 302025f399c7405c509a297b6b9a1e73 7bf1c59401206e31a78523077f3a1144 6765 taboao da serra SP
2 ff4ead15d934b5b5896ca73904d07f66 2ce81600d264e61fe46c8cadffd18db5 6786 taboao da serra SP
3 ceead79d927539feb1a27e3ec80cac36 f75da962fb4ddac1569e90c21803c455 6768 taboao da serra SP
4 cb5335826b0ab39de2c0fcc6f6e3d7d1 17edcf1332227a04be5ee81262f0757e 6767 taboao da serra SP
... ... ... ... ... ...
291 db72b1bc11dfa3a7d4dc04332a77b651 ee94c6489208799271238c965bdbd1ac 6784 taboao da serra SP
292 d581cc16c2132da59f81791d5672bb5e 2aa8f2dd8f41828e0583f7aad7d166e1 6783 taboao da serra SP
293 70299e106f6e6cab75bc08777a3a4480 dd68fedcc13ecfb683373b95b4215c28 6760 taboao da serra SP
294 31916c2ef56e3981ee1aa02ce0bec4b2 5033efbf8c4c2ca17bcb3a2fe3b5b9db 6753 taboao da serra SP
295 e7b71a9017aa05c9a7fd292d714858e8 d52a67c98be1cf6a5c84435bd38d095d 6764 taboao da serra SP

296 rows × 5 columns

Filter customer by state
In [ ]:
query6 = """
SELECT *
FROM customer_data
where customer_state = "SP"
"""
df6= ps.sqldf(query6, locals())
df6
Out[ ]:
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP
1 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 9790 sao bernardo do campo SP
2 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 1151 sao paulo SP
3 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 8775 mogi das cruzes SP
4 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP
... ... ... ... ... ...
41741 f255d679c7c86c24ef4861320d5b7675 d111b06b6f3a2add0d2241325f65b5ca 13500 rio claro SP
41742 f5a0b560f9e9427792a88bec97710212 b3e53d18a997f27a3ffd16da497eaf58 7790 cajamar SP
41743 17ddf5dd5d51696bb3d7c6291687be6f 1a29b476fee25c95fbafc67c5ac95cf8 3937 sao paulo SP
41744 e7b71a9017aa05c9a7fd292d714858e8 d52a67c98be1cf6a5c84435bd38d095d 6764 taboao da serra SP
41745 274fa6071e5e17fe303b9748641082c8 84732c5050c01db9b23e19ba39899398 6703 cotia SP

41746 rows × 5 columns

Search order by id
In [ ]:
order_data
Out[ ]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date order_purchase_year order_purchase_month order_purchase_day order_approve_year order_approve_month order_approve_day order_deliver_carrier_year order_deliver_carrier_month order_deliver_carrier_day order_deliver_customer_year order_deliver_customer_month order_deliver_customer_day order_estimate_delivery_year order_estimate_delivery_month order_estimate_delivery_day
0 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 2017-10-04 19:55:00 2017-10-10 21:25:13 2017-10-18 00:00:00 2017 10 2 2017 10 2 2017 10 4 2017 10 10 2017 10 18
1 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 2018-07-26 14:31:00 2018-08-07 15:27:45 2018-08-13 00:00:00 2018 7 24 2018 7 26 2018 7 26 2018 8 7 2018 8 13
2 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 2018-08-08 13:50:00 2018-08-17 18:06:29 2018-09-04 00:00:00 2018 8 8 2018 8 8 2018 8 8 2018 8 17 2018 9 4
3 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 2017-11-22 13:39:59 2017-12-02 00:28:42 2017-12-15 00:00:00 2017 11 18 2017 11 18 2017 11 22 2017 12 2 2017 12 15
4 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00 2018 2 13 2018 2 13 2018 2 14 2018 2 16 2018 2 26
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99436 9c5dedf39a927c1b2549525ed64a053c 39bd1228ee8140590ac3aca26f2dfe00 delivered 2017-03-09 09:54:05 2017-03-09 09:54:05 2017-03-10 11:18:03 2017-03-17 15:08:01 2017-03-28 00:00:00 2017 3 9 2017 3 9 2017 3 10 2017 3 17 2017 3 28
99437 63943bddc261676b46f01ca7ac2f7bd8 1fca14ff2861355f6e5f14306ff977a7 delivered 2018-02-06 12:58:58 2018-02-06 13:10:37 2018-02-07 23:22:42 2018-02-28 17:37:56 2018-03-02 00:00:00 2018 2 6 2018 2 6 2018 2 7 2018 2 28 2018 3 2
99438 83c1379a015df1e13d02aae0204711ab 1aa71eb042121263aafbe80c1b562c9c delivered 2017-08-27 14:46:43 2017-08-27 15:04:16 2017-08-28 20:52:26 2017-09-21 11:24:17 2017-09-27 00:00:00 2017 8 27 2017 8 27 2017 8 28 2017 9 21 2017 9 27
99439 11c177c8e97725db2631073c19f07b62 b331b74b18dc79bcdf6532d51e1637c1 delivered 2018-01-08 21:28:27 2018-01-08 21:36:21 2018-01-12 15:35:03 2018-01-25 23:32:54 2018-02-15 00:00:00 2018 1 8 2018 1 8 2018 1 12 2018 1 25 2018 2 15
99440 66dea50a8b16d9b4dee7af250b4be1a5 edb027a75a1449115f6b43211ae02a24 delivered 2018-03-08 20:57:30 2018-03-09 11:20:28 2018-03-09 22:11:59 2018-03-16 13:08:30 2018-04-03 00:00:00 2018 3 8 2018 3 9 2018 3 9 2018 3 16 2018 4 3

96461 rows × 23 columns

In [ ]:
query7 = """
SELECT *
FROM order_data
where order_id = "ad21c59c0840e6cb83a9ceb5573f8159"
"""
df7= ps.sqldf(query7, locals())
df7
Out[ ]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date order_purchase_year order_purchase_month order_purchase_day order_approve_year order_approve_month order_approve_day order_deliver_carrier_year order_deliver_carrier_month order_deliver_carrier_day order_deliver_customer_year order_deliver_customer_month order_deliver_customer_day order_estimate_delivery_year order_estimate_delivery_month order_estimate_delivery_day
0 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00 2018 2 13 2018 2 13 2018 2 14 2018 2 16 2018 2 26
Search order by customer id
In [ ]:
query8 = """
SELECT *
FROM order_data
where customer_id = "8ab97904e6daea8866dbdbc4fb7aad2c"
"""
df8= ps.sqldf(query8, locals())
df8
Out[ ]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date order_purchase_year order_purchase_month order_purchase_day order_approve_year order_approve_month order_approve_day order_deliver_carrier_year order_deliver_carrier_month order_deliver_carrier_day order_deliver_customer_year order_deliver_customer_month order_deliver_customer_day order_estimate_delivery_year order_estimate_delivery_month order_estimate_delivery_day
0 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00 2018 2 13 2018 2 13 2018 2 14 2018 2 16 2018 2 26
Filter order by status
In [ ]:
query9 = """
SELECT *
FROM order_data
where order_status = "canceled"
"""
df9 = ps.sqldf(query9, locals())
df9
Out[ ]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date order_purchase_year order_purchase_month order_purchase_day order_approve_year order_approve_month order_approve_day order_deliver_carrier_year order_deliver_carrier_month order_deliver_carrier_day order_deliver_customer_year order_deliver_customer_month order_deliver_customer_day order_estimate_delivery_year order_estimate_delivery_month order_estimate_delivery_day
0 1950d777989f6a877539f53795b4c3c3 1bccb206de9f0f25adc6871a1bcf77b2 canceled 2018-02-19 19:48:52 2018-02-19 20:56:05 2018-02-20 19:57:13 2018-03-21 22:03:51 2018-03-09 00:00:00 2018 2 19 2018 2 19 2018 2 20 2018 3 21 2018 3 9
1 dabf2b0e35b423f94618bf965fcb7514 5cdec0bb8cbdf53ffc8fdc212cd247c6 canceled 2016-10-09 00:56:52 2016-10-09 13:36:58 2016-10-13 13:36:59 2016-10-16 14:36:59 2016-11-30 00:00:00 2016 10 9 2016 10 9 2016 10 13 2016 10 16 2016 11 30
2 770d331c84e5b214bd9dc70a10b829d0 6c57e6119369185e575b36712766b0ef canceled 2016-10-07 14:52:30 2016-10-07 15:07:10 2016-10-11 15:07:11 2016-10-14 15:07:11 2016-11-29 00:00:00 2016 10 7 2016 10 7 2016 10 11 2016 10 14 2016 11 29
3 8beb59392e21af5eb9547ae1a9938d06 bf609b5741f71697f65ce3852c5d2623 canceled 2016-10-08 20:17:50 2016-10-09 14:34:30 2016-10-14 22:45:26 2016-10-19 18:47:43 2016-11-30 00:00:00 2016 10 8 2016 10 9 2016 10 14 2016 10 19 2016 11 30
4 65d1e226dfaeb8cdc42f665422522d14 70fc57eeae292675927697fe03ad3ff5 canceled 2016-10-03 21:01:41 2016-10-04 10:18:57 2016-10-25 12:14:28 2016-11-08 10:58:34 2016-11-25 00:00:00 2016 10 3 2016 10 4 2016 10 25 2016 11 8 2016 11 25
5 2c45c33d2f9cb8ff8b1c86cc28c11c30 de4caa97afa80c8eeac2ff4c8da5b72e canceled 2016-10-09 15:39:56 2016-10-10 10:40:49 2016-10-14 10:40:50 2016-11-09 14:53:50 2016-12-08 00:00:00 2016 10 9 2016 10 10 2016 10 14 2016 11 9 2016 12 8
Filter order by time
In [ ]:
order_data
Out[ ]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date order_purchase_year order_purchase_month order_purchase_day order_approve_year order_approve_month order_approve_day order_deliver_carrier_year order_deliver_carrier_month order_deliver_carrier_day order_deliver_customer_year order_deliver_customer_month order_deliver_customer_day order_estimate_delivery_year order_estimate_delivery_month order_estimate_delivery_day
0 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 2017-10-04 19:55:00 2017-10-10 21:25:13 2017-10-18 00:00:00 2017 10 2 2017 10 2 2017 10 4 2017 10 10 2017 10 18
1 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 2018-07-26 14:31:00 2018-08-07 15:27:45 2018-08-13 00:00:00 2018 7 24 2018 7 26 2018 7 26 2018 8 7 2018 8 13
2 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 2018-08-08 13:50:00 2018-08-17 18:06:29 2018-09-04 00:00:00 2018 8 8 2018 8 8 2018 8 8 2018 8 17 2018 9 4
3 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 2017-11-22 13:39:59 2017-12-02 00:28:42 2017-12-15 00:00:00 2017 11 18 2017 11 18 2017 11 22 2017 12 2 2017 12 15
4 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00 2018 2 13 2018 2 13 2018 2 14 2018 2 16 2018 2 26
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99436 9c5dedf39a927c1b2549525ed64a053c 39bd1228ee8140590ac3aca26f2dfe00 delivered 2017-03-09 09:54:05 2017-03-09 09:54:05 2017-03-10 11:18:03 2017-03-17 15:08:01 2017-03-28 00:00:00 2017 3 9 2017 3 9 2017 3 10 2017 3 17 2017 3 28
99437 63943bddc261676b46f01ca7ac2f7bd8 1fca14ff2861355f6e5f14306ff977a7 delivered 2018-02-06 12:58:58 2018-02-06 13:10:37 2018-02-07 23:22:42 2018-02-28 17:37:56 2018-03-02 00:00:00 2018 2 6 2018 2 6 2018 2 7 2018 2 28 2018 3 2
99438 83c1379a015df1e13d02aae0204711ab 1aa71eb042121263aafbe80c1b562c9c delivered 2017-08-27 14:46:43 2017-08-27 15:04:16 2017-08-28 20:52:26 2017-09-21 11:24:17 2017-09-27 00:00:00 2017 8 27 2017 8 27 2017 8 28 2017 9 21 2017 9 27
99439 11c177c8e97725db2631073c19f07b62 b331b74b18dc79bcdf6532d51e1637c1 delivered 2018-01-08 21:28:27 2018-01-08 21:36:21 2018-01-12 15:35:03 2018-01-25 23:32:54 2018-02-15 00:00:00 2018 1 8 2018 1 8 2018 1 12 2018 1 25 2018 2 15
99440 66dea50a8b16d9b4dee7af250b4be1a5 edb027a75a1449115f6b43211ae02a24 delivered 2018-03-08 20:57:30 2018-03-09 11:20:28 2018-03-09 22:11:59 2018-03-16 13:08:30 2018-04-03 00:00:00 2018 3 8 2018 3 9 2018 3 9 2018 3 16 2018 4 3

96461 rows × 23 columns

In [ ]:
query10 = """
SELECT *
FROM order_data
where order_purchase_year < 2018
"""
df10= ps.sqldf(query10, locals())
df10
Out[ ]:
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date order_purchase_year order_purchase_month order_purchase_day order_approve_year order_approve_month order_approve_day order_deliver_carrier_year order_deliver_carrier_month order_deliver_carrier_day order_deliver_customer_year order_deliver_customer_month order_deliver_customer_day order_estimate_delivery_year order_estimate_delivery_month order_estimate_delivery_day
0 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 2017-10-04 19:55:00 2017-10-10 21:25:13 2017-10-18 00:00:00 2017 10 2 2017 10 2 2017 10 4 2017 10 10 2017 10 18
1 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 2017-11-22 13:39:59 2017-12-02 00:28:42 2017-12-15 00:00:00 2017 11 18 2017 11 18 2017 11 22 2017 12 2 2017 12 15
2 a4591c265e18cb1dcee52889e2d8acc3 503740e9ca751ccdda7ba28e9ab8f608 delivered 2017-07-09 21:57:05 2017-07-09 22:10:13 2017-07-11 14:58:04 2017-07-26 10:57:55 2017-08-01 00:00:00 2017 7 9 2017 7 9 2017 7 11 2017 7 26 2017 8 1
3 6514b8ad8028c9f2cc2374ded245783f 9bdf08b4b3b52b5526ff42d37d47f222 delivered 2017-05-16 13:10:30 2017-05-16 13:22:11 2017-05-22 10:07:46 2017-05-26 12:55:51 2017-06-07 00:00:00 2017 5 16 2017 5 16 2017 5 22 2017 5 26 2017 6 7
4 76c6e866289321a7c93b82b54852dc33 f54a9f0e6b351c431402b8461ea51999 delivered 2017-01-23 18:29:09 2017-01-25 02:50:47 2017-01-26 14:16:31 2017-02-02 14:08:10 2017-03-06 00:00:00 2017 1 23 2017 1 25 2017 1 26 2017 2 2 2017 3 6
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
43678 9115830be804184b91f5c00f6f49f92d da2124f134f5dfbce9d06f29bdb6c308 delivered 2017-10-04 19:57:37 2017-10-04 20:07:14 2017-10-05 16:52:52 2017-10-20 20:25:45 2017-11-07 00:00:00 2017 10 4 2017 10 4 2017 10 5 2017 10 20 2017 11 7
43679 aa04ef5214580b06b10e2a378300db44 f01a6bfcc730456317e4081fe0c9940e delivered 2017-01-27 00:30:03 2017-01-27 01:05:25 2017-01-30 11:40:16 2017-02-07 13:15:25 2017-03-17 00:00:00 2017 1 27 2017 1 27 2017 1 30 2017 2 7 2017 3 17
43680 880675dff2150932f1601e1c07eadeeb 47cd45a6ac7b9fb16537df2ccffeb5ac delivered 2017-02-23 09:05:12 2017-02-23 09:15:11 2017-03-01 10:22:52 2017-03-06 11:08:08 2017-03-22 00:00:00 2017 2 23 2017 2 23 2017 3 1 2017 3 6 2017 3 22
43681 9c5dedf39a927c1b2549525ed64a053c 39bd1228ee8140590ac3aca26f2dfe00 delivered 2017-03-09 09:54:05 2017-03-09 09:54:05 2017-03-10 11:18:03 2017-03-17 15:08:01 2017-03-28 00:00:00 2017 3 9 2017 3 9 2017 3 10 2017 3 17 2017 3 28
43682 83c1379a015df1e13d02aae0204711ab 1aa71eb042121263aafbe80c1b562c9c delivered 2017-08-27 14:46:43 2017-08-27 15:04:16 2017-08-28 20:52:26 2017-09-21 11:24:17 2017-09-27 00:00:00 2017 8 27 2017 8 27 2017 8 28 2017 9 21 2017 9 27

43683 rows × 23 columns

Finally to retrive infomation related to geolocation, we use natural join for all datasets we have. so that we gain a geographical insights on the locations of customers, sellers, and products. The use of natural join in a database can be helpful for combining multiple datasets that share common attributes, such as the location of customers, sellers, and products. This allows for the retrieval of information related to geographical location, and can provide valuable insights into the distribution of customers, sellers, and products across a given region. By using natural join to combine these datasets, it is possible to gain a more comprehensive understanding of the relationships between different entities within a database. This can be useful for a variety of purposes, such as analyzing sales data, identifying trends and patterns, and making more informed decisions about how to allocate resources and manage operations.

Natural Join
In [ ]:
geolocation_data
Out[ ]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
0 1037 -23.545621 -46.639292 sao paulo SP
1 1046 -23.546081 -46.644820 sao paulo SP
2 1046 -23.546129 -46.642951 sao paulo SP
3 1041 -23.544392 -46.639499 sao paulo SP
4 1035 -23.541578 -46.641607 sao paulo SP
... ... ... ... ... ...
1000158 99950 -28.068639 -52.010705 tapejara RS
1000159 99900 -27.877125 -52.224882 getulio vargas RS
1000160 99950 -28.071855 -52.014716 tapejara RS
1000161 99980 -28.388932 -51.846871 david canabarro RS
1000162 99950 -28.070104 -52.018658 tapejara RS

1000163 rows × 5 columns

In [ ]:
query11 = """
SELECT *
FROM order_data
NATURAL JOIN items_data
NATURAL JOIN payment_data
NATURAL JOIN cleaned_product_data
NATURAL JOIN customer_data
NATURAL JOIN seller_data
NATURAL JOIN review_data
NATURAL JOIN geolocation_data
"""
df11= ps.sqldf(query11, locals())
df11

EDA

Explore Changes in Orders

EDA helps us get insights from another dimension of our datasets. An exploratory data analysis, or EDA, analyzes a dataset to summarize its main characteristics, often with visual methods. By performing an EDA, we can gain insights into the dataset's distribution, patterns, and relationships. This way can help identify potential trends, anomalies, or outliers that may require further investigation.

One way to perform an EDA is to query the dataset for specific periods, such as 2017 and 2018, and compare the results. This method can help us understand how the number of orders changed over time and identify any trends or patterns that may exist. For example, the number of orders increased significantly in 2018 compared to 2017, indicating potential growth in the business or market.

In addition to querying for specific periods, we can also use visual methods to explore the data. This query can include creating graphs, plots, or charts to visualize the data and help us better understand its characteristics.

Overall, an EDA can provide valuable insights into a dataset and help us identify trends, patterns, and relationships that may not be immediately apparent. Using a combination of querying and visual methods, we can gain a deeper understanding of our data and use it to inform analysis and decision-making.

In [ ]:
cleaned_product_data
In [ ]:
items_data
In [ ]:
order_data
In [ ]:
number_of_orders = order_data[(order_data['order_purchase_year'] == 2017)|\
                              (order_data['order_purchase_year'] == 2018)].count()
number_of_orders
In [ ]:
no_orders_2017 = order_data[(order_data['order_purchase_year'] == 2017)].count()
no_orders_2018 = order_data[(order_data['order_purchase_year'] == 2018)].count()
increment = ((no_orders_2018 - no_orders_2017) / no_orders_2017 * 100).round(2)
increment
Explore Changes in Sales and Payments

In addition to exploring changes in the number of orders, an EDA can also be used to examine changes in sales or the total amount of money generated by the orders. This way can help understand the growth or performance of the business over time.

To explore changes in sales between 2017 and 2018, we can query the dataset to retrieve the total sales for each year. We can then compare the results to determine the percentage increase or decrease in sales between the two years.

Once we have calculated the percentage change in sales, we can use visual methods to explore the data further. This info can include creating a graph or chart to show the difference in sales over time or comparing the sales data for different years using a bar chart or line graph. This way can help us identify any trends or patterns in the data and gain insights into the growth or performance of the business.

Overall, an EDA can be a powerful tool for exploring changes in sales over time and can provide valuable insights into the performance and growth of a business.

In [ ]:
payment_data
In [ ]:
payment_order = payment_data.merge(order_data, how='outer', on='order_id')
total_sales = payment_order.loc[(payment_order['order_purchase_year'] == 2017) |\
                                (payment_order['order_purchase_year'] == 2018), 'payment_value'].sum().round(2)
total_sales
In [ ]:
sales_2017 = payment_order.loc[(payment_order['order_purchase_year'] == 2017), 'payment_value'].sum().round(2)
sales_2018 = payment_order.loc[(payment_order['order_purchase_year'] == 2018), 'payment_value'].sum().round(2)
increment = ((sales_2018 - sales_2017) / sales_2017 * 100).round(2)
increment
Explore Changes in Customer Numbers

In addition to exploring changes in the number of orders and sales, an EDA can also be used to examine changes in the number of customers over time. This way can help understand the business's growth or performance and identify potential trends or patterns in customer behavior.

To explore changes in the number of customers between 2017 and 2018, we can query the dataset to retrieve the total number of customers for each year. We can then compare the results to determine the percentage increase or decrease in the number of customers between the two years.

In addition to examining changes in the number of customers over time, we can also use an EDA to explore the geographic distribution of customers. This way can help identify potential trends or patterns in customer behavior and understand the potential market for the business. To do this, we can query the dataset to retrieve the state or country of each customer. This method can help us understand the geographic reach of the business and identify any areas that may be potential growth markets.

Overall, an EDA can provide valuable insights into changes in the number of customers over time and the geographic distribution of customers.

In [ ]:
customer_data
In [ ]:
customer_order = customer_data.merge(order_data, how='left', on='customer_id')
number_of_customers = customer_order[(customer_order['order_purchase_year'] == 2017)|\
                              (customer_order['order_purchase_year'] == 2018)].nunique()
number_of_customers
In [ ]:
no_cust_2017 = customer_order[(customer_order['order_purchase_year'] == 2017)].count()
no_cust_2018 = customer_order[(customer_order['order_purchase_year'] == 2018)].count()
increment = ((no_cust_2018 - no_cust_2017) / no_cust_2017 * 100).round(2)
increment
In [ ]:
states = customer_data['customer_state'].nunique()
states
Explore Review Scores

An EDA can also be used to explore the review scores of products and to identify trends or patterns in customer feedback. This tool can help understand the quality and performance of the products and identify potential areas for improvement.

To explore the review scores of products, we can query the dataset to retrieve the review scores for each product. We can then calculate the percentage of products with a review score of five, which is often considered the highest rating. This way can help us understand customers' overall satisfaction with the products and identify any particularly popular or well-received products.

In addition to exploring the review scores of individual products, we can also use an EDA to identify related products and gain insights into potential market trends. This info can be useful for understanding the products that customers are interested in and for identifying potential areas for growth or expansion. To do this, we can query the dataset to retrieve the most frequently purchased products,

Overall, an EDA can provide valuable insights into the review scores and related products in a dataset and help us understand the performance and quality of the products, as well as potential market trends.

In [ ]:
review_data
In [ ]:
number_of_5_review = review_data[review_data['review_score'] == 5].count()
number_of_5_review
In [ ]:
percentage = (number_of_5_review / review_data['review_score'].count() * 100).round(2)
percentage

In addition to exploring the review scores and related products, an EDA can also be used to examine the sales of different product categories. This method can help understand the popularity and performance of different types of products and identify potential trends or patterns in customer behavior.

To explore the sales of different product categories, we can use a group-by method to group the sales data by type. This tool will allow us to calculate the total sales for each category and compare the results to determine which classes are the most popular.

In addition to exploring the sales of different product categories, we can also use an EDA to examine the sales trends over time. This way can help understand the business's growth or performance and identify potential expansion or improvement opportunities. To do this, we can query the dataset to retrieve the sales data for different periods, such as monthly or quarterly sales.

Overall, an EDA can provide valuable insights into the sales of different product categories and help us understand the popularity and performance of different types of products.

In [ ]:
sales_of_product = payment_order.merge(items_data, on='order_id').merge(cleaned_product_data, on='product_id')\
[['order_id', 'order_purchase_year', 'product_category_name_english', 'payment_value']]\
.groupby(['product_category_name_english', 'order_purchase_year']).sum()
sales_of_product
In [ ]:
cleaned_product_data.groupby('product_category_name_english')['product_category_name_english'].count().sort_values(ascending = False).head(10)
In [ ]:
payment_order.sort_values(by='order_purchase_timestamp')[['order_id', 'order_purchase_timestamp', 'payment_value']]
Explore City Data

In addition to exploring the sales of different product categories, an EDA can also be used to examine the attributes of other cities, such as GDP, residential population, and HDI. This way can help understand different cities' purchasing potential and economic conditions and identify potential trends or patterns in city data.

To explore the attributes of different cities, you can query the dataset to retrieve the relevant data, such as GDP, residential population, and HDI. You can then use sorting or grouping methods to organize the data and compare the results to identify which cities have the highest values for the selected attributes.

Overall, an EDA can provide valuable insights into the attributes of different cities and help you understand other regions' economic conditions and purchasing potential.

In [ ]:
city_data[['city', 'state', 'gdp']].sort_values('gdp', ascending=False).head(10)
In [ ]:
city_data[['gdp', 'res_population', 'HDI', 'COMP_TOT', 'pay_TV']].describe()

Summary

Overall, the Brazilian E-commerce and the Brazilian Cities datasets are rich with information and provide a wealth of opportunities for exploration and analysis. Although some cleaning and column renaming may be necessary, the datasets are generally well-structured. They contain a wide range of data that can be used to gain insights into the e-commerce market in Brazil and the economic conditions and purchasing potential of different cities.

Pandas and Pandas SQL are decisive for the data cleaning processes. By using exploratory data analysis (EDA) techniques, we can better understand the data and identify potential trends, patterns, and relationships within the datasets.

In addition to performing an EDA, we will use the data to create visualizations and web applications or interactive tools that allow users to explore the data and gain insights on their own. Our data will be more accessible and understandable in web applications, which will help others to understand and use the data to inform their analysis and decision-making.